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Abstract 

We study learnability in the online learning model. We define several complexity measures which cap- 
ture the difficulty of learning in a sequential manner. Among these measures are analogues of Rademacher 
complexity, covering numbers and fat shattering dimension from statistical learning theory. Relationship 
among these complexity measures, their connection to online learning, and tools for bounding them are 
provided. In the setting of supervised learning, finiteness of the introduced scale-sensitive parameters is 
shown to be equivalent to learnability. The complexities we define also ensure uniform convergence for 
non-i.i.d. data, extending the uniform Glivenko-Cantelli type results. We conclude by showing online 
learnability for an array of examples. 

1 Introduction 

In the online learning framework, the learner is faced with a sequence of data appearing at discrete time 
intervals. In contrast to the classical "batch" learning scenario where the learner is being evaluated after the 
sequence is completely revealed, in the online framework the learner is evaluated at every round. Furthermore, 
in the batch scenario the data source is typically assumed to be i.i.d. with an unknown distribution, while 
in the online framework we relax or eliminate any stochastic assumptions on the data source. As such, the 
online learning problem can be phrased as a repeated two-player game between the learner (player) and the 
adversary (Nature). 

Let be a class of functions and X some set. The Online Learning Model is defined as the following 
T-round interaction between the learner and the adversary: On round t = l,...,r, the learner chooses 
ft & -F, the adversary picks Xt G X, and the learner suffers loss ft{xt)- At the end of T rounds we define 
regret 

T T 

R.(/i:T,a;i:T) = ^ft{xt)- inf. XI 

i=l t=l 

as the difference between the cumulative loss of the player as compared to the cumulative loss of the best fixed 
comparator. For the given pair (J^, X), the problem is said to be online learnable if there exists an algorithm 
for the learner such that regret grows sublinearly. Learnability is closely related to Hannan consistency 

mm- 
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There has been a lot of interest in a particular setting of the online learning model, called online convex 
optimization. In this setting, we write Xt{ft) as the loss incurred by the learner, and the assumption is made 
that the function xt is convex in its argument. The particular convexity structure enables the development 
of optimization-based algorithms for learner's choices. Learnability and precise rates of growth of regret have 
been shown in a number of recent papers (e.g. [121 1331 13 H])- 

The online learning model also subsumes the prediction setting. In the latter, the learner's choice of a y- 
valued function gt leads to the loss of £{gt{zt),yt) according to a fixed loss function i : y x y i-^ M.. It is 
evident that the choice of the learner is equivalently written as ftix) — £{gt{z),y), and Xt = {zt,yt) is the 
choice of the adversary. In Section [7] we discuss the prediction setting in more detail. 

In the "batch" learning scenario, data {{xi,yiY\J^i is presented as an i.i.d. draw from a fixed distribution 
over some product X xy. Learnability results have been extensively studied in the PAC framework [33 and 
its agnostic extensions [THIIID]- It is well-known that learnability in the binary case (that is, y = {—1, +1}) 
is completely characterized by finiteness of the Vapnik-Chervonenkis combinatorial dimension of the function 
class |42LI40j . In the real- valued case, a number of combinatorial quantities have been proposed: P-dimension 
[29] , y-dimension [41] , as well as the scale-sensitive versions P-y-dimension [20l |6] and T^^-dimension [3] . The 
last two dimensions were shown to be characterizing learnability [3] and uniform convergence of means to 
expectations for function classes. 

In contrast to the classical learning setting, there has been surprisingly little work on characterizing learn- 
ability for the online learning framework. Littlestone [23 has shown that, in the setting of prediction of 
binary outcomes, a certain combinatorial property of the binary-valued function class characterizes learn- 
ability in the realizable case (that is, when the outcomes presented by the adversary are given according to 
some function in the class F). The result has been extended to the non-realizable case by Shai Ben-David, 
David Pal and Shai Shalev-Shwartz [8] who named this combinatorial quantity the Littlestone 's dimension. 
Coincident with [8] , minimax analysis of online convex optimization yielded new insights into the value of 
the game, its minimax dual representation, as well as algorithm-independent upper and lower bounds [H I36| . 
In this paper, we build upon these results and the findings of [5J to develop a theory of online learning. 

We show that in the online learning model, a notion which we call Sequential Rademacher complexity allows 
us to easily prove learnability for a vast array of problems. The role of this complexity is similar to the role of 
the Rademacher complexity in statistical learning theory. Next, we extend Littlestone's dimension to the real- 
valued case. We show that finiteness of this scale-sensitive version, which we call the fat- shattering dimension, 
is necessary and sufficient for learnability in the prediction setting. Extending the binary-valued result of 
[8], we introduce a generic algorithm which plays the role similar to that of empirical risk minimization 
for i.i.d. data: if the problem is learnable in the supervised setting, then it is learnable by this algorithm. 
Along the way we develop analogues of Massart's finite class lemma, the Dudley integral upper bound on 
the Sequential Rademacher complexity, appropriately defined packing and covering numbers, and even an 
analogue of the Sauer-Shelah combinatorial lemma. We also introduce a generalization of the uniform law 
of large numbers for non-i.i.d. distributions and show that finiteness of the fat-shattering dimension implies 
this convergence. 

Many of the results come with more work than their counterparts in statistical learning theory. In particular, 
instead of training sets we have to work with trees, making the results somewhat involved. While the spirit 
of the online theory is that it provides a "temporal" generalization of the "batch" learning problem, not all 
the results from statistical learning theory transfer to our setting. For instance, two distinct notions of a 
packing set exist for trees, and these notions can be seen to coincide in "batch" learning. The fact that many 
notions of statistical learning theory can be extended to the online learning model is indeed remarkable. 
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2 Preliminaries 



By phrasing the online learning model as a repeated game and considering its minimax value, we naturally 
arrive at an important object: trees. Unless specified, all trees considered in this paper are rooted binary 
trees with equal-depth paths from the root to the leaves. While it is useful to have the tree picture in mind 
when reading the paper, it is also necessary to precisely define trees as mathematical objects. We opt for 
the following definition. 

Definition 1 (Trees). Given some set Z, a Z -valued tree of depth T is a sequence (zi, . . . , zy) of T mappings 
Zj : {±1}'~^ I—)- Z. The root of the tree z is the constant function zi g Z. 

Armed with this definition, we can talk about various operations on trees. For a function f : Z ^U, /(x) 
denotes the Z^-valucd tree defined by the mappings (/ o xi, . . . , / o x^). Analogously, for / : Z x Z W, the 
W-valued tree /(x, x') is defined as mappings (/(xi, x'^), . . . , /(xt, x^)). In particular, this defines the usual 
binary arithmetic operations on real-valued trees. Furthermore, for a class of functions T and a tree x, the 
projection of T onto x is J^(x) = {/(x) : / e T}. 

Definition 2 (Path). A path of length T is a sequence e = (ei, . . . , ct-i) € {±1}^~^. 

We shall abuse notation by referring to x,((i, . . . , ei_i) by Xi(e). Clearly Xj only depends on the first i—1 
elements of e. We will also refer to e = (ei, . . . , ct) € {±1}^ as a path in a tree of depth T even though the 
value of eT is inconsequential. Next wc define the notion of subtrees. 

Definition 3 (Subtrees). The left subtree of z at the root is defined as T — 1 mappings (z^, . . . ,z^_-^) 
with z|(e) = Zj+i({— 1} X e) for e € {±1}^~^. The right subtree is defined analogously by conditioning 
on the first coordinate of Zj+i to be +1. 

Given two subtrees z, v of the same depth T — 1 and a constant mapping wi, we can join the two subtrees 
to obtain a new set of mappings (wi,...,wt) as follows. The root is the constant mapping wi. For 
i G {2, ...,r} and e G {±1}^, w,(e) = z,_i(e) if ei = -1 and w,(e) = Vi_i(e) if ei = +1. 

In the sequel, we will need to talk about the values given by the tree x over all the paths. Formally, let 
Img(x) = x ({±1}^) = {x,(e) : f G [T], e G {±1}-^} be the image of the mappings of x. 

Let us also introduce some notation not related to trees. We denote a sequence of the form (j/a, . . . , j/h), 
where a < &, by simply writing ija-.b ■ The set of all functions from A" to 3^ is denoted by y"^ , and the 
t-fold product x . . . x A" is denoted by A"*. For any T G N, [T] denotes the set {!,..., T}. A conditional 
distribution is written as Ef [A\ = E[A\Qi], where Qt is an appropriate filtration which will be specified. 
Whenever a supremum (infimum) is written in the form sup^j without a being quantified, it is assumed that 
a ranges over the set of all possible values which will be understood from the context. 

For the sake of readability, almost all the proofs are deferred to the appendix. 

3 Value of the Game 

Fix the sets and X and consider the online learning model stated in the introduction. We assume that J" 
is a subset of a separable metric space. Let Q be the set of probability measures on J" and assume that Q 
is weakly compact. We consider randomized learners who predict a distribution qt & Q on every round. We 
define the value of the game as 




(1) 
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where ft has distribution qt- We consider here the adaptive adversary who gets to choose each xt based on 
the history of moves fi.t-i and xi-t- 



The above definition is stated in the extensive form, but can be equivalently written in a strategic form. 
This is achieved by defining a learner's strategy tt as a sequence of mappings ttj : A'*^^ x J^*^^ n- Q for each 
i € [T], and analogously defining the adversarial strategy as a sequence of Tt : X*^^ x J^*^^ i-^ V . The value 
can then be written as 



Vt{:f,x) = 



{T ^1 
J2Mxt)^ inf Vfixt)} 



(2) 



where it is understood that each /( and Xt are successively drawn from the corresponding ttj and Tt given 
the history fi-t-i^xi-.t-i- While the strategic notation is more succinct, it hides the important sequential 
structure of the problem. This is the reason why we opt for the more explicit, yet more cumbersome, 
extensive form. 

The first key step is to appeal to the minimax theorem and exchange the pairs of infima and suprema in ([T]) . 
This dual formulation is easier to analyze because the choice of the player comes after the choice of the mixed 
strategy of the adversary. We remark that the minimax theorem holds under a very general assumption of 
weak compactness of Q. The assumptions on F that translate into weak compactness of Q are discussed in 
Appendix |Bj Compactness under weak topology allows us to appeal to Theorem [l] stated below, which is 
adapted for our needs from [T]. 

Theorem 1. Let T and X be the sets of moves for the two players, satisfying the necessary conditions for 
the minimax theorem to hold. Denote by Q and V the sets of probability measures (mixed strategies) on T 
and X , respectively. Then 



Vt{^tX) = inf sup E/j^qj ... inf sup E/^^q^ 



sup Exi~pi ... sup E^^^p^ 

Pi Pt 



t=i 



T 



^ mf^E,,^,, [ft{xt)] - M 



(3) 



The question of learnability in the online learning model is now reduced to the study of Vt{^ , X)^ taking 
Eq. ([3]) as the starting point. In particular, under our definition, showing that the value grows sublinearly 
with T is equivalent to showing learnability. 

Definition 4. A class F is said to be online learnable with respect to the given X if 

lim sup = . 

T->oo T 



The rest of the paper is aimed at understanding the value of the game Vt{F , X) for various function classes 
J-. Since complexity of J- is the focus of the paper, we shall often write Vt(-^), and the dependence on X 
will be implicit. 

One of the key notions introduced in this paper is the complexity which we term Sequential Rademacher 
complexity. A natural generalization of Rademacher complexity [22l H El] , the sequential analogue possesses 
many of the nice properties of its classical cousin. The properties are proved in Section |8] and then used to 
show learnability for many of the examples in Section [9] The first step, however, is to show that Sequential 
Rademacher complexity upper bounds the value of the game. This is the subject of the next section. 
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4 Random Averages 



We propose the following definition. The key difference from the classical notion is the dependence of the 
sequence of data on the sequence of signs (Rademacher random variables). As shown in the sequel, this 
dependence captures the sequential nature of the problem. 

Definition 5. The Sequential Rademacher Complexity of a function class J- C M*^ is defined as 

DXriJ") = supEe 

X 

where the outer supremum is taken over all A'- valued trees of depth T and e = (ei, . . . , ex) is a sequence of 
i.i.d. Rademacher random variables. 



T 

sup Vet/(xt(e)) 



In statistical learning, Rademacher complexity is shown to control uniform deviations of means and ex- 
pectations, and this control is key for learnability in the "batch" setting. We now show that Sequential 
Rademacher complexity upper-bounds the value of the game, suggesting its importance for online learning 
(see Section [t] for a lower bound). 

Theorem 2. The minimax value of a randomized game is bounded as 



Proof. From Eq. 



= supE^j, 
pi 




.sup E^^. 

Pt 


■^Pt 


E 

.t=i 


= supEj,^, 
pi 


•^Pl ■ 


. sup ExT'- 

Pt 


•^Pt 


sup 


< SUpEa,!, 
Pi 


■^Pl ■ 


. sup Ext" 
Pt 


■^Pt 


sup 



(4) 



(5) 



The last step, in fact, is the first time we deviated from keeping equalities. The upper bound is obtained by 
replacing each infimum by a particular choice /. Now renaming variables we have. 





supE^j 


~pi 


. .supE^^^p^ 


sup 




pi 




Pt 




< 


supE^^ 


~pi 


. .SUpE^^^py 






pi 




Pt 




< 


supE^^ 




. . . sup Exj-^xL^pt 




pi 




Pt 





. t=l 



t=l 

T 



.E, 



snp{J2fix[)~-J2f(^t) 



{T T y 

E/(^*)"E/(^*) 



where the last two steps are using Jensen inequality for the supremum. 
By the Key Technical Lemma (see Lemma [s] below) with 0(m) — u, 

SWpEx-^^x^^pi ■ ■ • SUpEj.^ sup 

PI PT [/e.F [^^^ J 



< sup Egj . . . sup Ef^ 



sup Vet ifix't) - f{xt)) 
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Thus, 



Vt(J") < sup ... sup E,^ 



sup^et {f{x[) - f{xt)) 



< sup 



. . sup Ee^ sup < V etf{x[) I + sup < V -etfixt) \ 
XT,x'^ L/e^it=i J U=i J. 



2 sup Eg J . . . supEgy 

Xi XT 



sup V et/(a;t) 



Now, we need to move the suprema over xts outside. This is achieved via an idea similar to skolemization 
in logic. We basically exploit the identity 



E, 



supG'(ei:f_i,a;t 



supEei^j_i [G(ei:t„i,Xf(ei.t„i))] 



that holds for any G : {±1}* x A" i-^ M. On the right the supremum is over functions Xi : {±1}* X . 

Using this identity once, we get. 



Vt(^) < 2 sup <{ E,,^, 





sup . 


. sup < 




sup \ 




^3 


XT 







t=3 



Now, use the identity T — 2 more times to successively move the supremums over 2:3, ... , xt outside, to get 



Vt(-F) < 2 sup Eei,..., 

a;i,X2,...,XT 



sup < ei/(xi) + V et/(xt(ei:t_i)) \ 



sup < Vet/(xf(e)) 



= 2supEej^^...^ey 

X 

where the last supremum is over A'-valued trees of depth T. Thus we have proved the required statement. □ 



Theorem [2] relies on the following technical lemma, which will be used again in Section |6] Its proof requires 
considerably more work than the classical symmetrization proof [131 127] due to the non-i.i.d. nature of the 
sequences. 

Lemma 3 (Key Technical Lemma). Let {xi, . . . ,xt) G be a sequence distributed according to D and let 
{x'l, . . . , x'rp) G X'^ be a tangent sequence. Let (/> : M h- > M be a measurable function. Then 



snpE^^^r^'^^p^ . . .supEj-y^^^^p^ 

Pi Pt 



sup V A/(a;t,a;;) 



< sup Ejj . . . sup Egj, 



sup Vet A/ (xt, a;;) 



where ei,...,eT are independent (of each other and everything else) Rademacher random variables and 
Af(xt,x^) = f{x[) — f{xt). The inequality also holds when an absolute value of the .sum is introduced on 
both .sides. 



Before proceeding, let us give some intuition behind the attained bounds. Theorem [T] establishes an upper 
bound on the value of the game in terms of a stochastic process on X. In general, it is difficult to get a 
handle on the behavior of this process. The key idea is to relate this process to a symmetrized version, and 
then pass to a new process obtained by fixing a binary tree x and then following a path in x using i.i.d. 
coin flips. In some sense, we are replacing the cr-algebra generated by the random process of Theorem [l] 
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by a simpler process generated by Rademacher random variables and a tree x. It can be shown that the 
original process and the simpler process are in fact close in a certain sense, yet the process generated by 
the Rademacher random variables is much easier to work with. It is precisely due to symmetrization that 
the trees we consider in this paper are binary trees and not full game trees. Passing to binary trees allows 
us to define covering numbers, combinatorial parameters, and other analogues of the classical notions from 
statistical learning theory. 

5 Covering Numbers and Combinatorial Parameters 

In statistical learning theory, learnability for binary classes of functions is characterized by the Vapnik- 
Chervonenkis combinatorial dimension [42] . For real- valued function classes, the corresponding notions are 
the scale-sensitive dimensions, such as P-y [SI IS]- For online learning, the notion characterizing learnability 
for binary prediction in the realizable case has been introduced by Littlestone [IS] and extended to the 
non-realizable case of binary prediction by Shai Ben-David, David Pal and Shai Shalev-Shwartz 0. Next, 
we define the Littlestone's dimension [25l [8] and propose its scale-sensitive versions for real-valued function 
classes. In the sequel, these combinatorial parameters are shown to control the growth of covering numbers 
on trees. In the setting of prediction, the combinatorial parameters are shown to exactly characterize 
learnability (see Section [7|. 

Definition 6. An ^Y-valued tree x of depth d is shattered by a function class T C {±1}'^ if for all e e {±1}'*, 
there exists f (z J- such that /(xt(e)) = et for all t G [d]. The Littlestone dimension Ldim(J^, X) is the largest 
d such that shatters an A'-valued tree of depth d. 

Definition 7. An A'-valued tree x of depth d is a-shattered by a function class C R'* , if there exists an 
M- valued tree s of depth d such that 



The tree s is called the witness to shattering. The fat- shattering dimension fata{J', ^) at scale a is the 
largest d such that a-shatters an A'-valued tree of depth d. 

With these definitions it is easy to see that ia.ta{J-, X) = Ldim(J^, A") for a binary- valued function class 
J" C {±1}'* for any < a < 2. 

When X and/or J- is understood from the context, we will simply write fata or faA,a{J') instead of fatQ,(J-", X). 
Furthermore, we will write ia.ta{T,x.) for fatQ(J^, Img(x)). In other words, fatQ(J", x) is the largest d such 
that T a-shatters a tree z of depth d with Img(z) C Img(x). 

Let us mention that if trees x are defined by constant mappings Xt(e) — Xt, the combinatorial parameters 
coincide with the Vapnik-Chervonenkis dimension and with the scale-sensitive dimension P^. Therefore, the 
notions we are studying are strict "temporal" generalizations of the VC theory. 

As in statistical learning theory, the combinatorial parameters are only useful if they can be shown to capture 
that aspect of T which is important for learnability. In particular, a "size" of a function class is known to 
be related to complexity of learning from i.i.d. data., and the classical way to measure "size" is through a 
cover or of a packing set. We propose the following definitions for online learning. 

Definition 8. A set V of M-valued trees of depth T is an a-cover (with respect to ip-norm) oi T C R-^ on 
a tree x of depth T if 



Ve e {±1}^ 3/ e J- s.t. e [d], Q(/(xt(e)) - st{e)) > a/2 




< a 



7 



The covering number of a function class on a given tree x is defined as 



J\fp{a,T,x.) = min{|y| : F is an a — cover w.r.t. ^p-norm of on x}. 



Further define J\fp{a, T) = sup^ A/^(q!, J", x), the maximal £p covering number of J" over depth T trees. 
In particular, a set V of M-valued trees of depth T is a 0-cover of J" C M-^ on a tree x of depth T if 



We denote by ^(0, x) the size of a smallest 0-cover on x and ^(0, J^, T) = supj^Af{Q, T, x). 

Let us discuss a subtle point. The 0-covcr should not be mistaken for the size |J^(x)| of the projection of T 
onto the tree x, and the same care should be taken when dealing with a-covers. Let us illustrate this with 
an example. Consider a tree x of depth T and suppose for simplicity that |Img(x)| =2^ — 1, i.e. the values 
of x are all distinct. Suppose T consists of '2?^~^ binary-valued functions defined as zero on all of Img(x) 
except for a single value of Img(xT). In plain words, each function is zero everywhere on the tree except for 
a single leaf. While the projection J^(x) has 2-^"^ distinct trees, the size of a 0-cover is only 2. It is enough 
to take an all-zero function go along with a function g\ which is zero on all of Img(x) except Img(xT) (i.e. 
on the leaves). It is easy to verify that (?o(x) and 5'i(x) provide a 0-cover for T on x, and therefore, unlike 
|J^(x)|, the size of the cover does not grow with T. The example is encouraging: our definition of a cover 
captures the fact that the function class is "simple" for any given path. 

Next, we naturally propose a definition of a packing. 

Definition 9. A set V of K-valued trees of depth T is said to be a-separated if 



The packing number 'Dp{a, T , x) of a function class on a given tree x is the size of the largest a-separated 
subset of {/(x) : / e J"}. 

Definition 10. A set V of K-valued trees of depth T is said to be strongly a-separated if 



The strong packing number A^p(q;, x) of a function class on a given tree x is the size of the largest 
strongly a-separated subset of {/(x) : / e J"}. 

Note the distinction between the packing number and the strong packing number. For the former, it must 
be that every member of the packing is a-separated from every other member on some path. For the latter, 
there must be a path on which every member of the packing is a-separatcd from every other member. 
This distinction docs not arise in the classical scenario of "batch" learning. We observe that if a tree x 
is defined by constant mappings Xj = Xt, the two notions of packing and strong packing coincide, i.e. 
I?p(a, J^, x) = A4p{a,T,x). The following lemma gives a relationship between covering numbers and the 
two notions of packing numbers. The form of this should be familiar, except for the distinction between the 
two types of packing numbers. 

Lemma 4. For any T C K'^, any X -valued tree x of depth T, and any a > 

Mp{2a,T,yi) <Mp{a,F,yi) < Pp(a, J",x). 

It is important to note that the gap between the two types of packing can be as much as 2^. 



V/ e J^, Ve e {±1}^ 3veV s.t. Vt{e) = /(xt(e)) 





> a 
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5.1 A Combinatorial Upper Bound 



We now relate the combinatorial parameters introduced in the previous section to the size of a cover. In the 
binary case (A: = 1 below), a reader might notice a similarity of Theorems |5] and [t] to the classical results due 
to Sauer [301, Shelah (33] (also, Perles and Shelah), and Vapnik and Chervonenkis [1^. There are several 
approaches to proving what is often called the Sauer-Shelah lemma. We opt for the inductive-style proof 
(e.g. Alon and Spencer [T). Dealing with trees, however, requires more work than in the VC case. 

Theorem 5. Let T C {0, . . . , fc}'^ he a class of functions with fat2(J^) = d. Then 



i=0 



Furthermore, for T > d 



i=0 



T\,, fekT\ 



i J \ d J 



d 



< ^ 



Armed with Theorem [s] we can approach the problem of bounding the size of a cover at an a scale by a 
discretization trick. For the classical case of a cover based on a set points, the discretization idea appears in 
[3j [28] . When passing from the combinatorial result to the cover at scale a in Corollary |6j it is crucial that 
Theorem [5] is in terms of fat2(J-") and not fati(J-"). This point can be seen in the proof of Corollary [6] (also 
see [ISj): the discretization process can assign almost identical function values to discrete values which differ 
by 1. This explains why the combinatorial result of Theorem [5] is proved for the 2-shattering dimension. 

We now show that the covering numbers are bounded in terms of the fat-shattering dimension. 

Corollary 6. Suppose T is a class of [—1, l]-valued functions on X . Then for any a > 0, any T > 0, and 
any X -valued tree x of depth T , 

/2eT\ ^'''"^•^^ 
A/'i(a, J-,x) < A/'2(a, J',x) <I\fo^{a,F,^) < ( — J 

With a proof similar to Theorem[5] a bound on the 0-cover can be proved in terms of the fati(J-") combinatorial 
parameter. Of particular interest is the case k — 1, when fati(J^) = Ldim(J^). 

Theorem 7. Let J- C {0, . . . , fc}'^ he a class of functions with fati(J^) = d. Then 

M{0,^,T)<Y,(^)k'<iekTf. 

Furthermore, for T > d 

^T\,, fekT"-'^ 



i=0 



EH 



< 



d 

In particular, the result holds for binary-valued function classes (k — 1), in which case fati(J^) = Ldim(J^). 



When bounding deviations of means from expectations uniformly over the function class, the usual approach 
proceeds by a symmetrization argument [14j followed by passing to a cover of the function class and a union 
bound (e.g. [27)). Alternatively, a more refined chaining analysis integrates over covering at different scales 
(e.g. |39p. By following the same path, we are able to prove a number of similar results for our setting. 
In the next section we present a bound similar to Massart's finite class lemma [261 Lemma 5.2], and in the 
following section this result will be used when integrating over different scales for the cover. 
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5.2 Finite Class Lemma and the Chaining Method 



Lemma 8. For any finite set V of M.-valued trees of depth T we have that 



< 



21off(|y|)max max > vt(e)^ 



A simple consequence of the above lemma is that if C [0, 1]'^ is a finite class, then for any given tree x we 
have that 



max^et/(xt(e)) 

t=i 



max y^etVt(e) 



< V2riog(|j-|) 



Note that if / e is associated with an "expert" , this result combined with Theorem [2] yields a bound 
given by the exponential weighted average forecaster algorithm (see [H]). In Section [9] we discuss this case 
in more detail. However, as we show next. Lemma |8] goes well beyond just finite classes and can be used to 
get an analog of Dudley entropy bound [\2\ for the online setting through a chaining argument. 

Definition 11. The Integrated complexity of a function class J- Q [—1, 1]'^ is defined as 

Dt(J^) = inf |4ra + \2 j log N2{5,F,T) d5^ . 



To prove the next theorem, we consider covers of the class J- at different scales that form a geometric 
progression. We zoom into a given function / G using covering elements at successive scales. This 
zooming in procedure is visualized as forming a chain that consists of links connecting elements of covers at 
successive scales. The Rademacher complexity of F can then be bounded by controlling the Rademacher 
complexity of the link classes, i.e. the class consisting of differences of functions from covers at neighbouring 
scales. This last part of the argument is the place where our proof becomes a bit more involved than the 
classical case. 

Theorem 9. For any function class J- C [—1, 1]'^, 



If a fat-shattering dimension of the class can be controlled. Corollary |6] together with Theorem [9] yield an 
upper bound on the value. 

We can now show that, in fact, the two complexity measures $Ht(-7^) and 2)t(.7^) are equivalent, up to a 
logarithmic factor. Before stating this result formally, we prove the following lemma which asserts that the 
fat-shattering dimensions at "large enough" scales cannot be too large. 

Lemma 10. For any [3 > ^D\t{J'), we have that fat^(J^) < T. 

The following lemma complements Theorem [9] 

Lemma 11. For any function class J- Q [—1, 1]"^, we have that 

©(J-) < 8 d\T{T) {l + 4^2 log^/2 (eT^)) 

as long as 1Ht(-^) > 1- 
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6 Uniform Convergence 



In the statistical setting it can be shown that learnabihty of supervised learning problem is equivalent to 
the so called uniform Glivenko-Cantelli property of the class which says that empirical averages converge 
to expected value of the function for any fixed distribution (samples drawn i.i.d.) and uniformly over the 
function class almost surely. We define below an analogous property for dependent distributions which 
requires that uniformly over function class the average value of the function converges to average conditional 
expectation of the function values almost surely. 

Definition 12. A function class T satisfies a Universal Uniform Convergence if for all a > 0, 

/ T \ 



lim supPo sup sup 



, T>ri /eJT 



> a 







where the supremum is over distributions D over infinite sequences {xi, . . . , xt, ■ ■ ■) 

We remark that the notion of uniform Glivenko-Cantelli classes is recovered if the supremum is taken over 
i.i.d. distributions. The theorem below shows that finite fat shattering dimension at all scales is a sufficient 
condition for Universal Uniform Convergence. 

Theorem 12. Let J- be a class of [—l,l\-valued functions. IfiataiJ') is finite for all a > 0, then J- satisfies 
Universal Uniform Convergence. 

The proof follows from the Lemma [T3| and Lemma [Ti] below, while Lemma [Ts] is an even stronger version of 
Lemma [141 We remark that Lemma [13l is the "in-probability" version of sequential symmetrization technique 
of Theorem [2] and Lemma 15 is the "in-probability" version of Theorem |9] 

Lemma 13. Let J- be a class of [—1, \]-valued functions. Then for any a > 



"d ( sup 



Y,{f{xt)-Et-i[f{x,)]) 



> a < 4 sup 



4 sup 



Y.e,f{^^{e)) 



> a/A 



Lemma 14. Let J- be a class of [—1, l]-valued functions. For any X -valued tree x of depth T and a > 



7f sup 



> a/4 < 2M(a/8, J-,x)e^^"'/i28 < 3 



leeTy 

a J 



-Ta^/128 



Next, we show that the sequential Rademacher complexity is, in some sense, the "right" complexity measure 
even when one considers high probability statements. 

Lemma 15. Let J- be a class of [—1, l]-valued functions and suppose iata{J') is finite for all a > 0. Then 
for any 9 > -^/S/T, for any X -valued tree x of depth T , 



sup 



< 



sup 



i X: ^tfiMe)) > 128 (1 + eVTlog'/\2T)) ■ 

1 ^ r /•! 

- V et/(xt(e)) > inf 4a + 120 / ^\ogMoo{^,T ,T)db 



where L is a constant such L > X]j^i-^oo(2 ^^J'jT) ^ . 

While the present paper is concerned with expected versions of minimax regret and the corresponding com- 
plexities, the above lemmas can be employed to give an analogous in-probability treatment. To obtain such 
in-probability statements, the value is defined as the minimax probability of regret exceeding a threshold. 
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7 Supervised Learning 



In this section we study the supervised learning problem where player picks a function ft G M at any time 
t and the adversary provides input target pair {xt,yt) and the player suffers loss \ft{xt) — Vtl- Note that if 
J" C {±1}*^ and each yt € {±1} then the problem boils down to binary classification problem. As we are 

interested in prediction, we ahow ft to be outside of T. 

Though we use the absolute loss in this section, it is easy to see that all the results hold (with modified 
rates) for any loss £{f{x),y) which is such that for all /, x and y, 



where $ and (p are monotonically increasing functions. For instance the squared loss is a classic example. 

To formally define the value of the online supervised learning game, fix a set of labels 3^ C [—1,1]. Given J^, 
define the associated loss class, 



Binary classification is, of course, a special case when y = {±1} and C {±1}'*. In that case, we simply 



Proposition 16. For the supervised learning game played with a function class T C [—1, 1]-^, for any T > 1 



Moreover, the lower hound ^t[T) < V^{T) on the value of the supervised game also holds. 

The proposition above implies that finiteness of the fat-shattering dimension is necessary and sufficient 

for Icarnability of a supervised game. Further, all the complexity notions introdiiccd so far are within a 
logarithmic factor from each other whenever the problem is learnable. These results are summarized in the 
next theorem. 

Theorem 17. For any function class C [— 1, l]'^, the following statements are equivalent 

1. Function class T is online learnable in the supervised setting. 

2. For any a > 0, ia,ta{J^) is finite. 

Moreover, if the function class is online learnable, then the value of the supervised game V^(J-"), the Sequential 
Rademacher complexity Dl{J^), and the Integrated complexity are within a multiplicative factor of 

©(log^/^ T) of each other. 

Corollary 18. For the binary classification game played with function class J- we have that 



my,y))<\y-y\<W{y,y)) 



J's = {ix,y)^\f{x)-y\ : feT}. 



Now, the supervised game is obtained using the pair {J^s, x 3^) and we accordingly define 

vU^) = Vt{Ts,x xy) . 



use V^'""'' for V|. 




(6) 



i^iV'rmin{Ldim(J-),T} < Vt"''''\J^) < K^s/T Ldim(J-)logT 



for some universal constants Ki,K2. 
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We wish to point out that the lower bound of Proposition 16 also holds for "improper" supervised learning 
algorithms, i.e. those that simply output a prediction ijt € y rather than a function ft € T. Formally, an 
improper supervised learning strategy tt that learns T using a class Q C is defined as a sequence of 
mappings 

nt:{Xx yy-^ ^Q, te[T] 



where Q denotes probability distributions over Q. 
learning game as 



We can define the value of the improper supervised 



inf_ sup Ef^^q^ 



■ inf_ sup Ey^ 

qreQ XT,yT 



"qr 



■ T 

E 

.t=i 



\9t{xt) 



yt\ 



^ t=i 



yt\ 



where gt has distribution qt- Note that V|.(J^;J^) — V|.(J^), the latter being the value of the "proper" 
learning game. We say that a class J-' is improperly online learnable in the supervised setting if 



lim sup 

T-j-oo 



T 



= 



for some Q. Since a proper learning strategy can always be used as an improper learning strategy, we trivially 
have that if class is online learnable in the supervised setting then it is improperly online learnable. Because 



of the above mentioned property of the lower bound of Proposition 16 we also have the non-trivial reverse 
implication: if a class is improperly online learnable in the supervised setting, it is online learnable. 

It is natural to ask whether being able to learn in the online model is different from learning in a batch 
model (in the supervised setting). The standard example (e.g. [2S1 H]) is the class of step functions on 
a bounded interval, which has a VC dimension 1, but is not learnable in the online setting. Indeed, it is 
possible to verify that the Littlestone's dimension is not bounded. Interestingly, the closely-related class of 
"ramp" functions (modified step functions with a Lipschitz transition between O's and I's) is learnable in 
the online setting (and in the batch case). We extend this example as follows. By taking a convex hull of 
step-up and step-down functions on a unit interval, we arrive at a class of functions of bounded variation, 
which is learnable in the batch model, but not in the online learning model. However, the class of Lipschitz 
functions of bounded variation is learnable in both models. Online learnability of the latter class is shown 
with techniques analogous to Section [9^ 



7.1 Generic Algorithm 

We shall now present a generic improper learning algorithm for the supervised setting that achieves a low 
regret bound whenever the function class is online learnable. For any a > define an a-discretization of the 
[-1,1] interval as = {-1 + a/2, -1 + 3a/2, . . . , -I + {2k + l)a/2, . . .} for < fc and {2k + l)a < 4. Also 
for any a e [—1, 1] define [aja = argmin \r — a\. For a set of functions V ^ any r G Ba and x Cz X define 

Vir, x)^{feV\ fix) e (r - a/2, r + a/2]} 
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Algorithm 1 Fat-SOA Algorithm (J", a) 

for t = 1 to T do 

Rt{x) = {r e Ba : iata{Vt{r,x)) = max^'es^ iata{Vt{r' ,x))} 
For each xeX,let ft{x) = T,r&R,ix) ^ 

Play ft and receive {xt,yt) 
if |/t(a;t) -yt \ <a then 

^"t+i - Vt 
else 

end if 
end for 



Lemma 19. Let J- C [—1, 1]"^ be a function class with finite fatQ,(J^). Suppose the learner is presented with 
a sequence {xi,yi), . . . , {xttVt), where yt = f{xt) for some fixed f Cz J- unknown to the player. Then for 
ft 's computed by the Algorithm^ it must hold that 



T 

^l{|/t(xO-yt| >a}<fat«(^). 
t=i 



Lemma [19] proves a bound on the performance of Algorithm [T] in the realizable setting. We now provide an 
algorithm for the agnostic setting. We achieve this by generating "experts" in a way similar to [5]. Using 
these experts along with the exponentially weighted average (EWA) algorithm we shall provide the generic 
algorithm for online supervised learning. The EWA (Algorithm wl) and its regret bound are provided in the 



appendix for completeness (p. 47) 



Algorithm 2 Expert {J^,a,l < ii < . . . < iL <T,Yi, . . . ,Yl) 



for t = 1 to T do 

Rt{x) ^ {r <E Ba : fat^ (Vt(r, a;)) = max^'gB^ Ma{Vt{r' ,x))} 
For each x e X, let fHx) = Er&R.ix) ^ 

if t € {ii, . . . , i^} then 

Va; e X, ft{x) = Yj where j is s.t. t = ij 

Play ft and receive Xt 

Vt+i^Vtift{xt),Xt) 
else 

Play ft = ft and receive Xt 
Vt+i = Vt 
end if 
end for 



For each L < ia,ta{J-) and every possible choice oi 1 < ii < . . . < i]^ < T and Yi, . . . ,Y]^ £ B^ we generate 
an expert. Denote this set of experts as Et- Each expert outputs a function ft £ J- ai every round T . Hence 
each expert e G Et can be seen as a sequence (ei, . . . , ct) of mappings Ct ■ X*~^ i— > J- . The total number of 
unique experts is clearly 

fat„ 

\Et\ = E 

Lemma 20. For any f £ T there exists an expert e S Et such that for any t G [T], 

\f{xt) - e(xi:t_i)(xf)| < a 
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Proof. By Lemma 19 for any function f & J^, the number of rounds on which \ft{xt) — fixt)\ > a for the 
output of the fat-SOA algorithm ft is bounded by idXa{J^)- Further on each such round there are \Ba\ — 1 
other possibilities. For any possible such sequence of "mistakes", there is an expert that predicts the right 
label on those time steps and on the remaining time agrees with the fat-SOA algorithm for that target 
function. Hence we see that there is always an expert e € Et such that 

\fixt) - e{xi.,t-i){xt)\ < a 

□ 

Theorem 21. For any a > if we run the exponentially weighted average (EWA) algorithm with the set 
Et of experts then the expected regret of the algorithm is hounded as 



2T\ 
a ) 



E [Rt] < aT + y Tfat^ log 

Proof. For any a > if we run EWA with corresponding set of experts E^ then we can guarantee that regret 

we have that the 
Combining we get 
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w.r.t. best expert in the set Et is bounded by yTfat^log (^)- However by Lemma 
regret of the best expert in Et w.r.t. best function in function class T is at most aP 
the required result. □ 

The above theorem holds for a fixed a. To provide a regret statement that optimizes over a we consider a^'s 
of form 2~* and assign weights Pi = to experts generated in above theorem for each and run EWA 

on the entire set of experts with these initial weights. Hence we get the following corollary. 

Corollary 22. Let J- C [—1, 1]'^. The expected regret of the algorithm described above is bounded as 



E [Rt] < inf <j aT + W Tfat^ log ( — ) + ( 3 + 2 log log f - 



ay V V a 



8 Structural Results 

Being able to bound complexity of a function class by a complexity of a simpler class is of great utility for 
proving bounds. In statistical learning theory, such structural results are obtained through properties of 
Rademacher averages [571 [7]. In particular, the contraction inequality due to Ledoux and Talagrand [Ml 
Corollary 3.17], allows one to pass from a composition of a Lipschitz function with a class to the function 
class itself. This wonderful property permits easy convergence proofs for a vast array of problems. 

We show that the notion of Sequential Rademacher complexity also enjoys many of the same properties. 
In Section |9j the effectiveness of the results is illustrated on a number of examples. First, we prove the 
contraction inequality. 

Lemma 23. Fix a class J- C and a function ; M x Z i— > M. Assume, for all z & Z , (/){■, z) is a Lipschitz 
function with a constant L. Then 

where (j>{T) = {z h-s- (f){f{z), z) : f e T}. 

We remark that the lemma above encompasses the case of a Lipschitz : M i-> M, as stated in [^17]. 
The next lemma bounds the Sequential Rademacher complexity for the product of function classes. 
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Lemma 24. Let T = Ti x . . . x where each Tj C M'^. Also /ef : M'^ i-> M he L-Lipschitz w.r.t. || • ||oo 
norm. Then we have that 

k 

5n(0o^) < LO (iog3/2(r)) Y^miJ^j) 

Corollary 25. For a fixed binary function b : {±1}'^ n- {±1} and classes J-i,...,J-k of {±l}-valued 
functions, 

k 

sK(5(j-i, . . . , j-fc)) < o (iog3/2(r)) ^5n(j-,) 

3 = 1 

In the next proposition, wc summarize some useful properties of Sequential Rademacher complexity (see 
[271 for the results in the i.i.d. setting) 

Proposition 26. Sequential Rademacher complexity satisfies the following properties. 

1. ifTc g, then < m{g). 

2. =m{conv{T)). 

3. d\{cT) \c\d\{F) for all c e M. 

4. If (t) : R R IS L-Lipschitz, then ^{cf){J')) < LD\{T). 

5. For any h, 9^(7" + h) ^ where T + h ^ {f + h : f £ F} 



9 Examples and Applications 

9.1 Example: Linear Function Classes 

Suppose J>y is a class consisting of linear functions x i-^ {wjx) where the weight vector w comes from some 
set W, 

= {a; n- {w, x) : w e W} . 

Often, it is possible to find a strongly convex function non- negative \E'(w) such that ^'(w) < 'fmax < oo for 
all w e W. Recall that a function ^ : W M is cr-strongly convex on W w.r.t. a norm || • || if, for all 

e [0, 1] and wi,W2e W, 

-^{ewi + (1 - 9)W2) < e-^iwi) + (1 - 0)'^iw2) - '^-^^-^-^-^\\WI - W2f 

We will give examples shortly but we first state a proposition that is useful to bound the Sequential 
Rademacher complexity of such linear function classes. 

Proposition 27. Let W be a class of weight vectors such that < "^{w) < ^max for all w G W. Suppose 
that ^' is a-strongly convex w.r.t. a given norm || • ||. Then, we have, 



V cr 

where = sup^^^rj^ II^^IU, the maximum dual norm of any vector in the input space. 

The proof of Proposition [27] is given in the appendix. It relies on the following lemma which can be found 
in |18| . There it is stated for i.i.d. mean zero Zi but the proof given works even for martingale difference 
sequences. 
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Lemma 28. Let 4" : W — > M 6e a-strongly convex w.r.t. || • || . Let Zt, t > 1 be a martingale difference sequence 
w.r.t. some filtration {Qt}t>i (i-e. ¥.[Zt\Qt-i\ = such thatV.\\\Zt\\'^ < . Define St ^^g^^Zg. Then, 
'^*{St) — tV'^ /2a is a supermartingale. Furthermore, ifud^j^w "^{w) > 0, then 

E[**(5t)]<^ . 

We will now show how to use the above result to derive minimax regret guarantees for online convex opti- 
mization. This is a particular instance of online learning where T — K Q where X is a bounded closed 
convex set. Suppose ||m|| < D for all u E K for some norm || • ||. The adversary's set X consists of convex 
G-Lipschitz (w.r.t. the dual norm || • functions over K: 

X — Afcvx — {g K ^M. : g convex and G-Lipschitz w.r.t. || • ||*} . 

We could directly try to bound the value Vt(-7^, '%'cvx) by CHt(-7^, -^cvx) but this, in fact, cannot give a non- 
trivial bound . Instead, we use the lemma below to bound the value of the convex game with that of the 
linear game, i.e. one in which 

X = Aiin = {u ^ {u,x) : ||a;||* < G} . 

Lemma 29. Suppose J- — K C M'^ is a closed bounded convex set and let X^v^jXnn be defined as above. 
Then, we have 



Using the above lemma in conjunction with Proposition 27 above, we can immediately conclude that 



Vt(.F, A-evx) < mTiT,Xnn) < G^2 ^m..T/a 

for any non-negative function ^ : X — > M that is cr-strongly w.r.t. || • ||. Note that, typically, ^Pmax will 
depend on D. For example, in the particular case when || • || = || • ||* = || • II21 we can take '^{u) — |||w||2 
and the above regret bound becomes GD\/T and recovers the guarantee of Zinkevich for his online gradient 
descent algorithm. In general, for || • || = || • || • ||^, = || • ||^, we can use ^(u) = ^IjuHp to get a bound of 
GD^JT / {p — 1) since is (p — l)-strongly convex w.r.t. || • ||p. These 0{\/T) regret rates are not new but 
we rederive them to illustrate the usefulness of the tools we developed. 



9.2 Example: Margin Based Regret 

In the classical statistical setting, margin bounds provide guarantees on expected zero-one loss of a classifier 
based on the empirical margin zero-one error. These results form the basis of the theory of large margin 
classifiers (see |3HI23j). Recently, in the online setting, margin bounds have been shown through the concept 
of margin via the Littlestone dimension |8]. We show that our machinery can easily lead to margin bounds 
for the binary classification games for general function classes T based on their sequential Rademacher 
Complexity. We use ideas from [23 to do this. 

Proposition 30. For any function class J- C M*^ bounded by 1, there exists a randomized player strategy 
given by tt such that for any sequence zi, . . . ,zt where each Zt = {xt, yt) ^ X x {±1}, played by the adversary, 



E 



[l{ft{xt)yt<0}] 



7>o 



-lRT(-F) + VT(3 + loglog 



1 



17 



9.3 Example : Neural Networks 



We provide below a bound on sequential Rademacher complexity for classic multi-layer neural networks thus 
showing they are learnable in the online setting. The model of neural network we consider below and the 
bounds we provide are analogous to the ones considered in the batch setting in f?]. We now consider a 
k-layer 1-norm neural network. To this end let function class J^i be given by 



7^1 = {x^Y^ w]xj 



\\M\i<Bi 



and further for each 2 < i < k define 



Proposition 31. Say cr : M i-> [—1,1] is L-Lipschitz, then 
where Xoo is such that Vx G X , ||a;||oo < ^oo o,nd X CZ M.'^ 

9.4 Example: Decision Trees 

We consider here the supervised learning game where adversary provides instances from instance space X 
and binary labels ±1 corresponding to the instances and the player plays decision trees of depth no more than 
d with decision functions from set 7i C {±1}'* of binary valued functions. The following proposition shows 
that there exists a player strategy which under certain circumstances could have low regret for the supervised 
learning (binary) game played with class of decision trees of depth at most d with decision functions from 
H. The proposition is analogous to the one in [71 considered in the batch (classical) setting. 

Proposition 32. Denote by T the class of decision trees of depth at most d with decision functions in %. 
There exists a randomized player strategy tt such that for any sequence of instances Zi = {xi, yi), . . . , zt = 
{xTjyr) E {X X {±1})-'" played by the adversary, 



E 



< inf ^l{i(x,)^yj 




CT(0,rflog'/'(r) m{H)) +VTi3 + 2logiNi,af)) 



where Ct{1) denotes the number of instances which reach the leaf I and are correctly classified in the decision 
tree t that minimizes X]t=i ^ {t{^t) ^ yt\ o,nd let Nieaf be the number of leaves in this tree. 

9.5 Example: Transductive Learning and Prediction of Individual Sequences 

Let J-" be a class of functions from X to M. Let 

Sfaoia,T) = mm{\G\: G CR'^ s.t. yf e T 3g e G satisfying \\f - g\\^ < a} . (7) 
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be the ^oo covering number at scale a, where the cover is pointwise on all of X. It is easy to verify that 

VT, N^ia,T,T) <M^ia,T) (8) 

Indeed, let G be a minimal cover of at scale a. We claim that the set V ~ {v^ g(x) : g E G} of M-valued 
trees is an £^ cover of T on x. Fix any e E {±1}^ and f E and let 5 G G be such that ||/ — (?||oo < 
Then clearly |v^(e) — /(xf(e))| for any 1 < t < T, which concludes the proof. 

This simple observation can be applied in several situations. First, consider the problem of transductive 
learning, where the set X = {zi,...,z„} is a finite set. To ensure online learnability, it is sufficient to 
consider an assumption on the dependence of Mao{oL,J-) on a. An obvious example of such a class is a 
VC-type class with Nao{oi,T) < {c/aY for some c which can depend on n. Assume that F C [—1,1]'^. 
Substituting this bound on the covering number into 

Dt(-F) = inf |4ra + 12 j log A/2 (5, J", T) dfjj 

and choosing a = 0, we observe that the value of the supervised game is upper bounded by 23^ (-7^) < 



4-^/ dT log c by Proposition 16 It is easy to see that if n is fixed and the problem is learnable in the batch 
(e.g. PAC) setting, then the problem is learnable in the online transductive model. 

In the transductive setting considered by Kakade and Kalai [T7| , it is assumed that n < T and T are binary- 
valued. If is a class with VC dimension d, the Sauer-Shelah lemma ensures that the £00 cover is smaller 
than [en/dy'' < {eT/dY. Using the previous argument with c = eT, we obtain a bound of 4^/ dT \og{eT) for 
the value of the game, matching [17] up to a constant 2. 

We also consider the problem of prediction of individual sequences, which has been studied both in informa- 
tion theory and in learning theory. In particular, in the case of binary prediction, Cesa-Bianchi and Lugosi 
[TU] proved upper bounds on the value of the game in terms of the (classical) Rademacher complexity and 
the (classical) Dudley integral. The particular assumption made in [lOj is that experts are static. That 
is, their prediction only depends on the current round, not on the past information. Formally, we define 
static experts as mappings / : {1, . . . , T} 3^ = [— 1, 1], and let denote a class of such experts. Defining 
X ~ {1, . . . , T} puts us in the setting considered earlier with n — T. We immediately obtain 4y/dTlog{eT), 
matching the results on [TUl p. 1873]. We mention that the upper bound in Theorem 4 in [TU] is tighter by 
a logT factor if a sharper bound on the £2 cover is considered. Finally, for the case of a finite number of 
experts, clearly A/'oo < N which gives the classical 0{^/T log A^) bound on the value of the game [TT] . 



9.6 Example: Isotron 

Recently, Kalai and Sastry [19^ introduced a method called Isotron for learning Single Index Models (SIM). 
These models generalize linear and logistic regression, generalized linear models, and classification by linear 
threshold functions. For brevity, we only describe the Idealized SIM problem from ^19j. In its "batch" 
version, we assume that the data is revealed at once as a set {{xi,yi)}f^i E M" x M where yt — u{{w,Xi)) 
for some unknown w E M" of bounded norm and an unknown non-decreasing u : M 1— >■ R with a bounded 
Lipschitz constant. Given this data, the goal is to iteratively find the function u and the direction w, making 
as few mistakes as possible. The error is measured as 5^ X]tLi(/i(^t) ~ 2/*)^' where fi{x) = Ui{{wi,x)) is 
the iterative approximation found by the algorithm on the ith round. The elegant computationally efficient 
method presented in [19^ is motivated by Perceptron, and a natural open question posed by the authors is 
whether there is an online variant of Isotron. Before even attempting a quest for such an algorithm, we can 
ask a more basic question: is the (Idealized) SIM problem even learnable in the online framework? After all, 
most online methods deal with convex functions, but u is only assumed to be Lipschitz and non- decreasing. 
We answer the question easily with the tools we have developed. 
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We are interested in online learnability in the supervised setting of the following class of functions 

n = {fix, y)^iy~ ui{w, x))f \ u : [-1, 1] ^ [-1, 1] 1-Lipschitz , ||^«||2 < 1} (9) 

over X — B2 (the unit Euclidean ball in M'') and y = [—1,1], where both u and w range over the possibilities. 
In particular, we prove the result for Lipschitz, but not necessarily non-decreasing functions. It is evident 
that H is a composition with three levels: the squared loss, the Lipschitz non-decreasing function, and the 
linear function. The proof of the following Proposition boils down to showing that the covering number of 
the class does not increase much under these compositions. 

Proposition 33. The class % defined in ^ is online learnable in the supervised setting. Moreover, 

VTiU,X xy) = 0{Vflog^/'^T). 



Proof. First, by the classical result of Kolmogorov and Tihomirov [3T], the class G of all bounded Lipschitz 
functions has small metric entropy: logAfoo{ct,G) = 6(l/a). For the particular class of non-decreasing 
1-Lipschitz functions, it is trivial to verify that the entropy is in fact bounded by 2/ a. 



Next, consider the class T — {{w,x) \ \\w\\2 < 1} over the Euclidean ball. By Proposition 27 *Ht(-^) < 



/2T'. Using the lower bound of Proposition 16 fato, < 64/q!^ whenever a > 8/Vt. This implies that 
J\foo{oi,J^,T) < (2eT/Q;)^^/" whenever a > 8/VT. Note that this bound does not depend on the ambient 
dimension of X . 

Next, we show that a composition of G with any small class J- C [—1, 1]*^ also has a small cover. To this 
end, suppose Afooicn, J-", T) is the covering number for J^. Fix a particular tree x and let V — {vi, . . . , v^r} 
be an £00 cover of on x at scale a. Analogously, let W = {gi, . . . , gpj} be an £00 cover of G with 
M = Afooia, G)- Consider the class GoJ^^{gof-g<^G,f & J^}- The claim is that {^(v) : v e V", g e W} 
provides an ^oo cover ioi G ° J' on x. Fix any f £ J-,g G G and e E {±1}-^. Let v e be such that 
maxjg[7-] |/(xt(e)) — V((e)| < a, and let g' E W he such that \\g — g'\\oo < ct- Then, using the fact that 
functions in G are 1-Lipschitz, for any t £ [T], 

|5(/(x,(e))) - g'{Me))\ < l5(/(x*(e))) - .g'(/(x,(e))| + \g'ifiM^)) - g'iMe))\ < 2a . 

Hence, A/'oo (2a, g o ^, T) < aL (a, g) X A4o (a, T) . 

Finally, we put all the pieces together. By Lemma[23j the Sequential Rademacher complexity of TL is bounded 
by 4 times the Sequential Rademacher complexity of the class 

G oT ^ {u{{vj,x)) I u : [-1,1] ^ [-1,1] is 1-Lipschitz , ||w||2 < 1} 

since the squared loss is 4-Lipschitz on the space of possible values. The latter complexity is then bounded 

by 

©(^oJT) < 32\/t + 12 / log AA((5, GoF,T)d5< 32Vt + uVT [ J\ + ^ log(2eT)d5 . 

Jb/Vt Js/Vt V d 

We conclude that the value of the game VT{n,X x y) = 0{VT log^^^ T). □ 
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A Proofs 



Proof of Theorem^ For simplicity, denote ^/'(xu) = inf/gj^- X^tLi fi^t)- The first step in the proof is 
to appeal to the minimax theorem for every couple of inf and sup: 



inf supE/j^^gj . . . inf supE/^^gy 

91 Xl IT XT 



inf supE/^^gj... inf supE/^^^^ 



sup inf Ef^^g^ ... sup inf E/^^,^ 

Pl f?lty Xin^pi Pt qT^W. Xrpr^prp 



supinf Ej,j^pj . . . sup inf Ej-y^py 

Pl /i PT fT 



^ ft{xt) - iI}{x1:t) 
T 

t=l 
T 

^Mxt) - iP{x1:t) 
t=l 

^Mxt) - 1P{X1:T 



(by Minimax theorem) 



From now on, it will be understood that Xf has distribution pf. By moving the expectation with respect to 
xt and then the infimum with respect to inside the expression, we arrive at 



supinf Ejjj . . . sup inf Ea;^ ^^ sup 

Pl fl PT~1 fT — l Pt 



T-1 



t=l 



= supinf Ea;j . . . sup inf Ex^^i supE^; 

Pl fl Pt-1 /t-1 pt 



r-i 

J2Mxt) 



inf Ex^frixT) 

It 



t=l 



inf E^^frixT) 

It 



Ex^iP{xi;t) 

1p{xi:T 



Let us now repeat the procedure for step T — 1. The above expression is equal to 



supinf Ej;j . . . sup inf Ext^i 

Pl fl Pt-1 /t-1 



X! Mxt) + supEj 



inf E2;^/T(a;T) - fpixi-.r) 

fr 



= supinf Ea;j . . . sup 

Pl fl Pt-1 



T-2 



inf Ex^_^fT-i{xT-i) 

/t-1 



Ej,^_^ supE^jj, 

Pt 



inf E:rT/T(a;T) - i>{xi:T) 

fr 



sup inf Ej,^ . . . sup Ea;y _j supE-^^ 

Pl fl Pt-1 Pt 



'T-2 



inf Ex^_Jt-i{xt-i) 

/r-i 



inf Erc^frixr) 

fT 



iP{x1:t) 



Continuing in this fashion for T — 2 and all the way down to t = 1 proves the theorem. 



□ 



Proof of the Key Technical Lemma (Lemma We start by noting that since xt, x'rp are both drawn 
from Pt, 



E 



sup V A/(xt,a;t) 



E 



XT ^Xt^PT 



T-1 



sup V l^f{xt,x^) + A/(xT,x^) 



= E 



Xt -xt^Pt 



r-1 



sup E l^f{^t,x'^) + l\j[xT,x'j) 



= E 



XT ^Xt^PT 



T-1 



sup E ^f{^t,Xi) - l^f{xT,XT) 

^/e^ t=i , 
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where the last hne is by antisymmetry of A j . Since the first and last lines are equal, they are both equal to 
their average and hence 



E 



XT .X'j.r-^pjr 



' T-1 \ 

sup V Af{xt,x[) 
J''^ t=i / 



E 



XT ,X'j,^PT 



E, 



T-l 



Hence we conclude that 



SUpEa;^,^./ 



sup V A/(xt,a;;) 



supE^^^^^^p^ 

Pt 



E 



sup S2 Af{xt,x[) + eTAf{xT,x'rp) 
v/e^ t=i / 



sup V Aj(xt,x'^) + tTAj{xT-,x'rj.) 



< sup Eey 



T-l 



sup Af{xt,x[) + eTAf{xT,x'rp 



Using the above and noting that XT-iTx'rp_^ are both drawn from pr-i and hence similar to previous step 
introducing Rademacher variable er-i we get that 



sup 'RxT-i.x'^_^--PT-i SUpE^.^^^^ 



/ T \ " 

sup V A/(xt,x;) 

J. 



— sup ExT_j^^x'j,_-^r^PT-l 

Pt-1 



sup E^T 

xt.xL 



T-l 



sup ^ Af{xt,Xt) + erAfixT, x'rp) 



sup E2.^_j ,[;^^^^pyEgj,_j^ 



sup Eey 



T-2 



sup ^ A/(a;t,x't) + er-i A/(a::T-i, x^„i) + €t A f {xt , x'rp) 



< sup Ec5,_i 

1 -iX r-p -|^ 



sup Ee^ 

XT ,xij, 



T-2 



sup Af{xt,Xt) + eT~iAf{xT-i,x'rp_j^) + eTAf{xT,x'rp) 



Proceeding in similar fashion introducing Rademacher variables all the way upto ei we finally get the required 
statement that 



"^^P ^a;i ,a:'^~pi ■ ■ ■ SUp E^;^^^;;^^^^ 
Pi Pt 



sup V A/(a;t,a;;) 



An identical proof with the absolute value gives 

T 



supE^^^^'^^p^ . . .supEj-^^^^^p^ 

Pi Pt 



sup 



< sup Eei ■ ■ ■ sup Egj, 



< sup E^j^ . . . sup Ejy 

Xi^x'^ XTjX'j. 



sup Vet A/ (xt, a;;) 



sup 



J2'^tAf{xt,Xt) 

□ 



t=i 



Proof of Lemma\B^ For any A > 0, we invoke Jensen's inequality to get 

T 



M(A) := exp <^ AE, 



< E, 



t=i 

T 



max exp A (e) 



t=i 



< E, 



< E, 



exp I A max ^ Vj (e) | 
^ exp<^ A^etvt(e) I 



.vsy 



t=i 
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With the usual technique of peeling from the end, 

T 



M(A) < ^ E,„...,,, 



]Jexp{AetVt(ei:t_i)} 
.t=i 

nr. , /exp{AvT(ei:T-i)} +exp{-AvT(ei:T-i)}\ 

exp{AetVt(ei;t_i)} x J 

n^~^ r\ / ^^ { >?^T{ei:T-if\ 
exp{AetV((ei:t_i)} x exp <^ ^— '- \ 



where we used the inequality 5 {exp(a) + exp(— a)} < exp(a^/2), valid for all a e M. Peeling off the second 
term is a bit more involved: 



M(A)< ^E,„„.,, 



T-2 



Y[ exp{AetVt(ei:t_i)} x 



t=i 



- I exp{AvT-i(ei:T-2)}exp<^ 



- exp {— AvT-i(ei:T-2)} exp 



A2v,-((ria-2,-l))' 



Consider the term inside: 

1/ rA2vT((ei:T-2 1^^' 



exp|AvT-i(ei:T-2)}exp<^ > + exp |-AvT-i(ei:T-2)}exp <^ > I 



< max ( exp 

€t-1 



\^WT{{e\:T-1-,eT-\)Y\\ exp{AvT-l(ei:T-2)} + exp{-AVT-l(ei:T-2)} 



< max 



2 } J 2 



K (^exp I 2 1 j V 



exp 



A2maXj^_jg|±l} (vT-l(ei:T-2)^ + \T{il:T-lf 



Repeating the last steps, we show that for any i, 



M(A) < ^ E,„...,,,_, 



exp {AetVt(ei:t_i)} x exp • 



A^ max,....,^_^£{±i} Y,^^^ Wt{ei:t-if 



.t=i 



We arrive at 



M(A) < E -P ( ^'"^"^-^-^-^^^^^>^"-^"*(^^-*-^)' I 

veV I J 

' A^ maxvev max^g^ii^T Y^^i ^ti^f \ 



< |V|exp' 



Taking logarithms on both sides, dividing by A and setting A = . / 2iog(|y|) conclude 

T 



that 



max 
vev ■ 



21og(|y|)max max > Vt(e 
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□ 



Proof of Lemma^ We prove the first inequality. Let {w^, . . . ,w*^} be a largest strongly 2Q;-separated 
set of J-{x} with A/ = A4p{2a, x). Let {v^, . . . , v^} be a smallest a-cover of on x with N ~ Afp{a, x). 
For the sake of contradiction, assume M > N. Consider a path e G {±1}-^ on which all the trees 
{w"'^, . . . , w*^} are (2a)-separated. By the definition of a cover, for any there exists a tree such 
that 

Since M > N, there must exist distinct w' and w*^, for which the covering tree v-' is the same for the given 
path e. By triangle inequality, 

s i/p 

-w,^6)r <2a. 




which is a contradiction. We conclude that M < N. 

Now, we prove the second inequality. Consider a maximal a-packing V C J^(x) of size 'Dp{a,J-',x). Since 
this is a maximal a-packing, for any f £ J^, there is no path on which /(x) is a-separated from every member 
of the packing. In other words, for every path e e {±1}-^, there is a member of the packing v G V such that 

S 1/P 

/(x*(e))rj <a 

which means that the packing is a cover. □ 




Proof of Theorem [5[ For any d > and T > 0, define the function 

i=0 ^ ^ 

It is not difficult to verify that this function satisfies the recurrence 

gk{d, T) = gk[d, T - 1) + kgu{d - 1, T - 1) 

for all d, r > 1. To visualize this recursion, consider a, k x T matrix and ask for ways to choose at most 
d columns followed by a choice among the k rows for each chosen column. The task can be decomposed 
into (a) making the d column choices out of the first T — 1 columns, followed by picking rows (there are 
gk{d,T — 1) ways to do it) or (b) choosing d — 1 columns (followed by row choices) out of the first T — 1 
columns and choosing a row for the Tth column (there are kgk{d — 1, T — 1) ways to do it). This gives the 
recursive formula. 

In what follows, we shall refer to an L^o cover at scale 1/2 simply as a 1/2-cover. The theorem claims that 
the size of a minimal 1/2-cover is at most gk{d, T). The proof proceeds by induction on T + d. 

Base: For d = 1 and T 1, there is only one node in the tree, i.e. the tree is defined by the constant 
xi e X. Functions in F can take up to fc + 1 values on Xi, i.e. Af{0,J-, 1) < A; + 1 (and, thus, also for 
the 1/2-cover). Using the convention (g) = 1, we indeed verify that ^^(1,1) = 1 + k = k + I. The same 
calculation gives the base case for T = 1 and any d G N. Furthermore, for any T G N if c? = 0, then there is 
no point which is 2-shattered by This means that functions in J- differ by at most 1 on any point of X . 
Thus, there is a 1/2 cover of size 1 — gk{0,T), verifying this base case. 
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Induction step: Suppose by the way of induction that the statement holds for (rf, T — 1) and (d— 1, T— 1). 
Consider any tree x of depth T with fat2(J-^, x) = d. Define the partition = Jxi U . . . U J-fe with Ti = {f & 
T : /(xi) = i} for 2 G {0, . . . , A:}, where xi is the root of x. Let n=\{i: fat2(J-i,x) = d\\. 

Suppose first, for the sake of contradiction, that fat2(J^i,x) = fat2(J\,-,x) = d for — > 2. Then there exist 
two trees z and v of depth d which arc 2-shattered by Ti and Tj, respectively, and with Img(z), Iing(v) C 
Img(x). Since functions within each subset J^i take on the same values on Xi, we conclude that Xi ^ 
Img(z),xi ^ Img(v). This follows immediately from the definition of shattering. We now join the two 
shattered z and v trees with xi at the root and observe that Ti U Tj 2-shatters this resulting tree of depth 
d+1, which is a contradiction. Indeed, the witness M- valued tree s is constructed by joining the two witnesses 
for the 2-shattered trees z and v and by defining the root as si = (i + j)/2. It is easy to see that s is a 
witness to the shattering. Given any e € {±1}''+^, there is a function € J^i which realizes the desired 
separation under the signs (e2, . . . , Cd+i) for the tree z and there is a function G J^j which does the same 
for V. Depending on ei = +1 or ei = —1, either /' or realize the separation over e. 

We conclude that the number of subsets of J" with fat-shattering dimension equal to d cannot be more than 

two (for otherwise at least two indices will be separated by 2 or more). We have three cases: n = 0, n = 1, 
or n = 2, and in the last case it must be that the indices of the two subsets differ by 1. 

First, consider any J^i with fat2(J^i, x) < d — 1, z S {0, . . . , A:}. By induction, there are 1/2-covers and 
of J^i on the subtrees x^ and x'', respectively, both of size at most gk{d — 1,T — 1). Informally, out of these 
1/2-covers we can create a 1/2-cover V for Ti on x by pairing the 1/2-covcrs in and . The resulting 
cover of J^i will be of size gk{d— l,T — 1). Formally, consider a set of pairs (v^, v*") of trees, with G V^, 
g yr g^^^ such that each tree in and V"^ appears in at least one of the pairs. Clearly, this can be 
done using at most gk{d — 1,T — 1) pairs, and such a pairing is not unique. We join the subtrees in every 
pair (v^, v*") with a constant i as the root, thus creating a set V of trees, \V\ < gk(d — 1,T — 1). We claim 
that V is B, 1/2-cover for on x. Note that all the functions in take on the same value i on xi and by 
construction Vi = i for any v G Now, consider any f d Fi and e G {±1}'^. Without loss of generality, 
assume e\ = —1. By assumption, there is a G such that |vj(e2:T) — /(xf+i(ei:T))| < 1/2 for any 
t G [T — 1]. By construction appears as a left subtree of at least one tree in V, which, therefore, matches 
the values of / for ei;T- The same argument holds for ei = +1 by finding an appropriate subtree in y . We 
conclude that F is a 1/2-cover of J^i on x, and this holds for any i G {0, . . . , fc} with fat2(J-"i,x) < d — 1. 
Therefore, the total size of a 1/2-cover for the union Uj.fat2(^j,x)<d-i-^i is at most {k + l — n)gk{d—l,T— 1). 
If n = 0, the induction step is proven because gk{d — 1,T — 1) < gk{d,T — 1) and so the total size of the 
constructed cover is at most 

{k + l)gk{d - 1, T - 1) < T - 1) + kgk{d - 1, T - 1) = g^id, T). 

Now, consider the case n = 1 and let fat2(J^i,x) = d. An argument exactly as above yields a 1/2-cover for 
Ti, and this cover is of size at most gk{d,T— 1) by induction. The total 1/2-cover is therefore of size at most 

gk{d, T-l) + kgk{d-l,T-l)= gk{d, T). 

Lastly, for n = 2, suppose fat2(J-i,x) = fat2(J"j,x) = d for |i — j| = 1. Let T' = J^iU Fj. Note that 
fat2(J-"',x) = d. Just as before, the 1/2-covering for x can be constructed by considering the 1/2-covers for 
the two subtrees. However, when joining any (v^,v''), we take {i + j)/2 as the root. It is straightforward 
to check that the resulting cover is indeed an 1/2-cover, thanks to the relation |i — j| = 1. The size of the 
constructed cover is, by induction, gk{d,T— 1), and the induction step follows. This concludes the induction 
proof, yielding the main statement of the theorem. 

Finally, the upper bound on gk{d,T) is 
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whenever T > d. 

□ 

Proof of Theorem The proof is very close to the proof of Theorem [5j with a few key differences. As 
before, for any d > and T > 0, define the function gtid, T) — X^iLo (T)^*- 

The theorem claims that the size of a minimal 0-cover is at most g^. (d, T) . The proof proceeds by induction 
onT + d. 

Base: For d = 1 and T = 1, there is only one node in the tree, i.e. the tree is defined by the constant 
xi e X. Functions in can take up to fc + 1 values on Xi, i.e. A/'(0, 1) < fc + 1. Using the convention 
(^) = 1, we indeed verify that = 1 + fc = k + 1. The same calculation gives the base case for T — \ 

and any d S N. Furthermore, for any T G N if d = 0, then there is no point which is 1-shattered by F. This 
means that all functions in !F are identical, proving that there is a 0-cover of size 1 — gk{0,T). 

Induction step: Suppose by the way of induction that the statement holds for {d,T ~ 1) and (d— 1). 
Consider any tree x of depth T with fati(J^, x) — d. Define the partition T = U . . . U J-k with = {f E 
T : /(xi) = i} for i € {0, . . . , A:}, where xi is the root of x. 

We first argue that fati(J^i, x) = d for at most one value « S {0, . . . , fc}. By the way of contradiction, suppose 
we do have fati(J-i,x) = fati(J-"j,x) — d for j. Then there exist two trees z and v of depth d 1-shattered 
by Fi and Fj, respectively, and with Img(z), Img(v) C Img(x). Since functions within each subset Ti take 
on the same values on Xi, we conclude that Xi ^ Img(z),Xi ^ Img(v). This follows immediately from the 
definition of shattering. We now join the two shattered z and v trees with Xi at the root and observe that 
Fi U Fj 1-shatters this resulting tree of depth d -I- 1, which is a contradiction. Indeed, the witness K- valued 
tree s is constructed by joining the two witnesses for the 1-shattered trees z and v and by defining the root 
as Si = (i -\- j)/2. It is easy to see that s is a witness to the shattering. Given any e S {±1}''"''^, there is a 
function /* G Fi which realizes the desired separation under the signs (e2, . . . , e.d+i) for the tree z and there 
is a function e Fj which does the same for v. Depending on ei = +1 or ei = —1, either or realize 
the separation over e. 

We conclude that fatl(J^i,x) = d for at most one i e {0,...,fc}. Without loss of generality, assume 
fati(Jb) x) < d and fati(Ji, x) < d — 1 for i e {1, . . . , fc}. By induction, for any Fi, i £ {1, . . . , fc}, there 
are 0-covers and of Fi on the subtrees x^ and x"", respectively, both of size at most gk{d— 1, — 1). 
Out of these 0-covers we can create a 0-cover V for Fi on x by pairing the 0-covers in and V"^ . Formally, 
consider a set of pairs (v^, v'') of trees, with € v*" S V"^ and such that each tree in and appears 
in at least one of the pairs. Clearly, this can be done using at most gk{d—\^T—l) pairs, and such a pairing 
is not unique. We join the subtrees in every pair (v^, v'') with a constant i as the root, thus creating a set 
V of trees, \V\ < gk{d — 1,T — 1). We claim that ^ is a 0-cover for Fi on x. Note that all the functions in 
Fi take on the same value i on xi and by construction Vi = i for any v € V. Now, consider any f £ Fi 
and e £ {±1}^. Without loss of generality, assume ci — —1. By assumption, there is a e such that 
vf(e2:T) = /(xt+i(ei:T)) for any t G [T — 1]. By construction appears as a left subtree of at least one tree 
in V, which, therefore, matches the values of / for ei;T- The same argument holds for ei — +1 by finding an 
appropriate subtree in . We conclude that F is a 0-cover of Fi on x, and this holds for any z e {1, . . . , fc}. 

Therefore, the total size of a 0-cover for J^i U . . . U Fk is at most kgk{d — 1, T — 1). A similar argument yields 
a 0-cover for Fq on x of size at most gk{d, T — 1) by induction. Thus, the size of the resulting 0-cover of F 
on X is at most 

gkid,T-l) + kgkid^l,T-l)=gk{d,T), 
completing the induction step and yielding the main statement of the theorem. 
The upper bound on gk{d,T) appears in the proof of Theorem [sj 

□ 
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Proof of Corollary^ The first two inequalities follow by simple comparison of norms. It remains to 
prove the bomid for the £00 covering. For any a > define an a-discretization of the [—1,1] interval as 
Ba = {-l + a/2, -l + 3a/2, . . . , -1 + (2fc + l)a/2, . . .} for < fc and (2fc + l)a < 4. Also for any a € [-1,1] 
define [ajc, = argmin jr- a| with ties being broken by choosing the smaller discretization point. For a function 

f : X ^ [—1, 1] let the function [/J^ be defined pointwise as [/(a;)Ja, and let [J^Jq = {[/]«:/£ First, 
we prove that Mao{oi,IF ,yi) < Moc{<^/'2, [J^Ja,x). Indeed, suppose the set of trees is a minimal a/2-cover 
of [J^J a on X. That is, 

V/a e L-^Ja, Ve e {±1}^ 3weV s.t. |vt(e) - /a(xt(e))| < a/2 

Pick any / e J" and let fa = [f\a- Then ||/ - /„||oo < a/2. Then for aU e G {±1}^ and any t £ [T] 

|/(x,(6)) - v,(6)| < |/(x,(6)) - /„(x,(e))| + |/„(x,(e)) - v,(6)| < a, 

and so V also provides an Loo cover at scale a. 

We conclude that A/'oo(a,-7^, x) < A/'oo(a/2, [J^Jq,x) = A/'oo(l/2, t/, x) where G = ^[^Jq- The functions of 
G take on a discrete set of at most [2/aJ + 1 values. Obviously, by adding a constant to all the functions 
in Q, we can make the set of values to be {0, . . . , [2/aJ}. We now apply Theorem [s] with an upper bound 
Eto (T)^' ^ i^'^'^f ^hich holds for any T > 0. This yields A/'oo(l/2, x) < {2eT/a)^''^^^^\ 

It remains to prove iat2{G) < i&ta{J^), or, equivalently (by scaling) fat2Q([J^jQ,) < fatc,(J^). To this end, 
suppose there exists an M- valued tree x of depth d — t&t2a{[J^\a) such that there is an witness tree s with 

Ve e {±1}^ 3/„ e lT\a s.t. yt e [d], et(/„(xt(e)) - St(e)) > a 

Using the fact that for any f E J- and = [f\a we have || / — /q||oo < a/2, it follows that 

Ve e {±1}^ 3/ e J- s.t. yt e [d], et(/(xt(e)) - St(e)) > a/2 

That is, s is a witness to a-shattering by J^. Thus for any x, 

fat2„ a) /ry rp\{ato,(J^) 



Woo (a, J", x) <^foc{a/2, L^Ja,x) < 



□ 



Proof of Theorem [9[ Define /Sq = 1 and f3j —2 ^ . For a fixed tree x of depth T, let Vj be an £2-cover at 
scale f3j. For any path e e {±1}"^" and any / e , let v[/, e]^ e V,- the element of the cover such that 



\ -^|v[/,e]^(e)-/(x,(e))|2</3,. 
\ t=i 

By the definition such a v[/, e]^ € Vj exists, and we assume for simplicity this element is unique (ties can 
be broken in an arbitrary manner). Thus, / 1— v[/, e]-' is a well-defined mapping for any fixed e and j. As 
before, v[/, e]l denotes the t-th mapping of v[/, e]-'. For any i G [T], we have 

N 

/(x,(e)) = /(x,(e)) - v[/, e]f (e) + ^(v[/, e]^"(e) - v[/, e]r\e)) 
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where v[/, e]((e) = 0. Hence, 



sup Vet/(xt(e)) 



T / JV \ " 

sup /(x,(.)) - v[/,e]r(e) + ^(v[/,e]^ (.) - v[/,e]rH^)) 

f^^tl \ U J. 

T T / N \ 

sup J2 {fiM^)) - v[/, (^)) + E ^* E(^[/' (^) - ^[/' ^ir' (^)) 



supE^* (/(x,(.))-v[/,e]r(e) 



«upE^' E(^[/'^i*(^)-^[/'^]r'w) 



(10) 



The first term above can be bounded via the Cauchy-Schwarz inequality as 

T 



supE^t (/(x,(e))-v[/,e]f(e)) 



<TE, 



\^ Y (/(xt(6))-v[/,e]f'(e)) 



<T I3_ 



N- 



The second term in (10) is bounded by considering successive refinements of the cover. The argument, 
however, is more delicate than in the classical case, as the trees v[/, e]^ , v[/, e]^^^ depend on the particular 
path. Consider all possible pairs of e Vj and v'' G Vj^i, for 1 < s < \Vj\, I < r < IVj-i], where we 
assumed an arbitrary enumeration of elements. For each pair (v", v*"), define a real-valued tree w*^*'''^ by 



wl >{e) 



vf(e) - Vi(e) if there exists f e T s.t. v'' = v[/, e]^',v'' ^ v[f,eY^^ 
otherwise. 



for all t e [T] and e G {±1}-^. It is crucial that w^'*'''' can be non-zero only on those paths e for which v" 
and v'' are indeed the members of the covers (at successive resolutions) close to /(x(e)) (in the £2 sense) for 
some f £ J^. It is easy to see that w'^^'') is well-defined. Let the set of trees Wj be defined as 



1 < s < \Vj\,l <r < \V, 



Now, the second term in ( 10 ) can be written as 

T N 



t=l j=l 



N 

<EiE. 

JV 

<Ee. 



supE^t(v[/,6]^(6)-v[/,e]r'(e)) 



max E etWt(e) 

' t=i 



The last inequality holds because for any j £ [N], e £ {±1}-^ and f £ J- there is some w^'*''"'' e Wj with 
v[/,ep = v^ v[/,ep"-i = v-- and 



Clearly, \Wj\ < • |V,_i|. To invoke Lemma [sj it remains to bound the mag nitude of all w('^''^) e Wj 
along all paths. For this purpose, fix w*^'''''^ and a path e. If there exists / € for which v** = v[/, e]-' and 
— v[/, e]-*^^, then W(^''''*(e) — v[/, e]{ — v[/, e]{~^ for any t G [T]. By triangle inequality 



\ t=i 



\ t=i 



\ t=i 
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If there exists no such f T for the given e and (s, r), then w^"''^'' (e) is zero for aU t > to, for some 1 < to < T, 
and thus 



\ t=i 



\ t=i 



for any other path e' which agrees with e up to to- Hence, the bound 



\ t=l 

holds for all e G {±1}'^ and all w^^^'^) € VF,-. 

Now, back to ( [lO| , we put everything together and apply Lemma [sj 



sup VeJ(xt(e)) 



TV 



<T/3N + Vf E3/?,Y21og(|y,| 

TV 

<T (3N + Vf Y,(^l3j\og{\Vj\) 



N 



<T/3w + 12 / log AA2(5,^,x) d(5 



where the last but one step is because 2(/3j — = f3j. Now for any a > 0, pick = sup{j : /3j > 2a}. 

In this case we see that by our choice of N, /^at+i < 2a and so /S^v — 2(3^+1 < 4q;. Also note that since 

-1 - 2 



Pn > 2a, Pn+i = ^ > a. Hence we conclude that 



fRT(^) < inf |4ra + 12 j log M-2,{b,T ,T) 



□ 



Proof of Lemma \10\ Consider some /? > for which fat^ > T. This implies that there exists a tree x* 
that is /3-shattercd (using threshold tree say s) by the function class F. Hence we have that 



sup Vet/(x*(e)) 



E, 



sup^et (/(x*(e)) - S((e)) 



> E, 



E 



T/3 



where the step before the last is using the fact that x* tree using threshold tree s is /3-shattered by J-. Hence 
we conclude that fat^ > T implies that /3 < |; 9^7^(7^). The converse is the required statement. □ 

Proof of Lemma 11_ By Corollary |6j we have that 

©(J") < inf ^ATa + I2y/f j ^Jiaip log(2eT//3) d/jj . 

Choosing a = ^d\T{J-), 

©(J") < 8 fRT(^) + 12\/t / Ji3Xfi\og{2eT/P)dl3 . 
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Lemma 10 together with the lower bound of Proposition 16 (proved later in the Appendix) implies that for 
any /3 >JmT{T), 

The following inequality is useful for bounding the integral: For any 6 > 1 and a e (0, 1) 



I \ ^\og{h/W = I - ^^dx = \ log3/2 {x)"^ <\ log3/2 (b/a) 

Ja P Jb X 3 6 3 

where we performed a change of variables with x = b//3. 



(11) 



Using Eq. (11) 



/■I 1 

S)( J-) < 8 fRriJ") + 48^2 D^tC-T") / -75 V\og{2eT/f3)d/3 



< 8 "RTiJ") + 32%/2 5Rt(-7^) log^^^ 



Using the assumption D\t{J') > 1 concludes the proof. We remark that the assumption is very mild and 
satisfied for any non-trivial class T. □ 

Proof of Theorem \ 1 2\ Combining Lemma [T3| and Lemma [M} for any D, 



7f sup 









> a 


< 8 





fat„ 



Applying Borel-Cantelli lemma proves the required result as 



< 00 



□ 



Proof of Lemma \13\ Let {x[, . . . , xip) be a sequence tangent to (xi, . . . , xt)- Recall the notation 

Et-i[f{x[)]=E{f{x't)\xi,...,xt-i}. 
By Chebychev's inequality, for any / G J^, 



t=i 



> a/2 



Xi,...,Xt 



E 



< 



Et=iifix't)-Et~i[f{4)])] 









f 


Xi, . 


■ ,XT 



r2Q,2/4 

ELi E [(/(x'J - E,^i [fix',)]f Ixi, . . . , XT 



< 



AT 16 



T2a74 Ta2- 

The second step is due to the fact that the cross terms are zero: 

E {(/(x;) - Et_i [fix',)]) (fix',) - E,_i [fix',)]) \xi, . . . ,xt} = 



30 



Hence 



inf Pd 



^(/K)-E,_i[/(x;)]) 



< a/2 



Xi,.. .,XT 



> 1 - 



16 



Whenever > ^ we can conclude that 



inf Pd 



^(/(a:;)-E,_i[/(x;)]) 



< a/2 



Xi,...,XT 



> 



Now given a fixed xi, ...,xt let /* be the function that maximizes )p X)t=i ifi^t) ~ ^t-i[f{xt)]) 
f* is a deterministic choice given xi, ...,xt- Hence 



Note that 



- < inf Pd 
2 - /e^ 



< 



t=i 

^(rK)-E,_i[rK)]) 



t=i 



< a/2 ....xt) 
< a/2 {xi, ...,Xt) 



Let A = |(a;i, . . . , xt) ^ supjgjr | X^tLi /(^t) ~ [/(xj)] | > q;|. Since the above inequahty holds for any 
Xit . . ,xt we can assert that 



- < I 
2 - 

This, in turn, imphes that 

2 ° 

< 



1 

sup - 



< a/2 (xi, . . . ,xt) e A 



^(/(xO-E,_i[/(x;)]) 

T 

^(r(x;)-E,_i[r(x;)]) 



> a 



X Pd 



1^ sup 



< a/2 {xi,...,xt) e A 
5^(/(x,)-E,_i[/(x;)]) 



> a 



The latter product can be upper bounded by 

^ E(r(^o-r(^;)) 
t=i 

Now we apply Lemma|3]with <j){u) = 1 {m > a/2}: 



> a/2 




1 


<Pd 


sup 







E(/(x,)-/(x;)) 



> a/2 



E 


H 


sup 









> a/2 



< sup Ejj . . . sup Egyl < sup 



if{x',)~fixt)) 



>a/2 



(12) 
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The next few steps are similar to the proof of Theorem [2j Since 



sup 



Y^e, {J{x',)-f{xt)) 



t=i 



< sup 



sup 



t=i 



it is true that 

T 



1 < sup 



> a/2 ^ < 1 sup 



> a/4 ^ + 1 ^ sup 



> a/4 



The right-hand side of Eq. (12) then sphts into two equal parts: 

f ^ 

E^*/(^*) 



sup . . . sup Eg^ 1 < sup 

xi XT I feT 



t=i 



> a/4 > + supEej . . . supEj^l < sup 
I x', x' I /e^ 



E^*/(^'*: 



> a/4 



2 sup Egj . . . sup Egy 1 < sup 

xi XT I 



Moving to the tree representation, 

T 



> a/4 



sup 



E(/(^0-/K)) 



> a/2 



< 2supE£ 

X 

= 2supPe 



1 S sup 

T 



> a/4 



7^ sup 



E^*/(^*(^)) 



We can now conclude that 

T 



sup 



E(/(^t)-Et-i[/( 



t=l 



> a 



< 4 sup 



t=i 



; sup 



> a/4 



E^*/(^*w) 



> a/4 



□ 



Proof of Lemma \14\ Fix an A*- valued tree x of depth T. By assumption fatQ,(J^) < cjo for any a > 0. Let 
y be a minimum £i-cover of J- over x at scale a/8. Corollary [6] ensures that 



/16eT\ 



fate 



|F| =M(a/8,^,x) < J 
and for any f £ T and e € {±1}^, there exists v[/, e] € F such that 

;^E|/(x,(e))-v[/,6],(6)|<a/ 
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on the given path e. Hence 



; sup 



> a/4 



< 



< 



1^ sup 



1 

sup 



sup 



Y^et (/(x,(e))-v[/,e],(e)+v[/,6],(6)) 
t=i 

T 

(/(x,(e))-v[/,6],(6)) 

t=l 

T 



1 

sup 



> a/ A 



t=i 



> a/A 



> al 



For fixed e = (ei, . . . , et), 



1 

T 



sup 

and, therefore, for any x, 

T 



> a/ 



— max 
T vey 



> al 



1 

77; sup 



^E' 



E^*/(^*w) 

T 



> a/4 



> al 



< 



— max 



> a/ 



< 2|y|e-^"'/i28 < 2 



V 



□ 



Proof of Lemma 15 We first prove the second inequality in Lemma 15 The proof cfosely foUows that of 
Theorem [oj Define /?o = 1 and /3j = For a fixed tree x of depth T, let V, be an ^oo-cover at scale Pj. 
For any path e € {±1}"^ and any / e J^, let v[/, e]^ G V,- a /3j-close element of the cover in the ^ao sense. 
Now, for any f <^ T , 

T T N T 

-Ee*/(xt(e)) < -Ee*(/(^*w)-v[/,£]f) +E ^E^*(^[/'^]*-^[/'^r' 

t=l t=l j=l t=l 



N 



<inax|/(x,(6))-v[/,.]f|+E 



^E6,(v[/,6]^"-v[/,6]r') 



Thus, 



sup 



T ( N T 

t=i •''^•^ [ j=i t=i 



We now proceed to upper bound the second term. Consider all possible pairs of v'* e Vj and v*" e 

for l<s<|V}|,l<r< |V}_i|, where we assumed an arbitrary enumeration of elements. For each pair 

(v**, v''), define a real-valued tree w^*''') by 

(s.r), X _ J Vt(e) - vj'(e) if there exists f eT s.t. v'' = v[/,e]^v'' = v[/,e]-'-i 



otherwise. 
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for all t G [T] and e G {±1}-^. It is crucial that w'^*''') can be non-zero only on those paths e for which 
and v'' are indeed the members of the covers (at successive resolutions) close in the £2 sense to some f E 
It is easy to see that w^'*'''^ is well-defined. Let the set of trees Wj be defined as 



1 < s < \Vj\,l <r < \V, 



Using the above notations we see that 



sup 



T f ^ 1 ^ 1 



N 

< I3n + Y sup 



(13) 



Similarly to the proof of Theorem [oj we can show that max^j^ |w( (e)| < f3j for any w-' G Wj and any path 
e. 

In the remainder of the proof we will use the shorthand NooiP) — J\foo{l3,J-,T). By Azuma-Hoeffding 
inequality for real- valued martingales, 



Hence by union bound we have, 



f sup ;^E^*^t(^) >^/?,VlogAU^^) <2A/'oo (/?,)' exp 



and so 



3j G [N], sup 



> ^/3,YlogAAoo(/3,)j < 2EA/'oo(/3j)' exp <j - 



Hence clearly 

TV 



E 



1 ^ ^ / \ ^ r 

-E^tw^w >eE/^jV^°g-^-('^^) <2E-^-(/5j)' - 

t=i j=i y j=i I- 



Using the above with Equation ( 13 1 gives us that 



(T N \ TV , 

sup - J2^tf{Me)) > pN+eY^P.^logUooiP,) <2EA/'oo (/?,)' exp - 

< 2Eexp|logAAoo(/3, ) (2 
7=1 ^ 



TOHogNoom 



TfP 



Since we assume that 2 < ^j-, the right-hand side of the last inequality is bounded above by 



2Eexp|- 

.7=1 



< 2Eexp -— - \ogM^{fi,) < 2e- V Y^^^il^,)-' 
J 1=1 .7=1 
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If MooiPj) < log( J-)"'^"'''^ then the sum is finite and we arc done. So assume that J2j=i-^oo{Pj) ^ < L ior 
some appropriate constant L. Hence we see that 



sup 



N 



>Pn + 0J2^j\/ log-^oo )] <Le- 



Now picking N appropriately and bounding sum by integral we have that 

N 



Hence we conclude that 




^^6,/(x,(e)) 



> inf |4a + 1261 j ^/\ogNoo{5,T , T)d(5| j < Le' 



This proves the second inequality of Lemma 15 To prove the first inequality, we note that by a proof 



identical to that of Lemma 11 (with an extra factor of 6 and normalization by T), 

mf |4a + 129 j ^logN^S.F ,T)d5^ < ^^^^ • 8 (^1 + AV29^T\og^ (eT^)^ . 
Over-bounding the constants, we conclude the proof. 



□ 



Proof of Proposition \16\ For the upper bound, we start by using Theorem [2] to bound the value of the 
game by Sequential Rademacher complexity. 



Using the Lipschitz composition lemma (Lemma 231 with Z — X x y and (a;,?;)) = \t — y|, we have 



y^iJ's) < JH(J^). This is because \t — y\ is 1-Lipschitz in t for any y. Hence, V^{T) < 2*H(J^). We combine 
Theorem [9] and Corollary [6] to obtain the upper bound. 

For the lower bound, we use a construction similar to [8]. We construct a particular distribution which 
induces a lower bound on regret for any algorithm. For any a > by definition of fat-shattering dimension, 
there exists a tree x of depth d = fat^ (T) that can be a-shattered by J-. For simplicity, we assume T — kd 
where k is some non-negative integer, and the case T < d is discussed at the end of the proof. Now, define 
the jth block of time = {(j - l)fc -f 1, . . . , jfc}. 

Now the strategy of Nature (Adversary) is to first pick e € {±1}"^ independently and uniformly at random. 
Further let e € {±1}'' be defined as e.j = sign |^X]teT ^t) '^'^^ 1 < J < c^, the block-wise modal sign of e. Now 
note that by definition of a-shattering, there exists a witness tree s such that for any e G {±1}'* there exists 
/e e with ej(/£(xj(e)) — Sj(e)) > a/2 for all 1 < j < fi. Now let the random sequence (xi, yi), . . . , {xt, Vt) 
be defined by Xt = Xj(e) for all t G Tj and j € {1, . . . ,d} and yt = e*. In the remainder of the proof we show 
that any algorithm suffers large expected regret. 

Now consider any player strategy (possibly randomized) making prediction yt G [^1,1] at round t. Note 
that if we consider block j, yt = it is ±1 uniformly at random. This means that irrespective of what j/t the 
player plays, the expectation over of the loss the player suffers at round t is 



1 
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Hence on block j, the expected loss accumulated by any player is k and so for any player strategy (possibly 
randomized) , 



E 



E 



yt - yt\ 



^^k = dk = T 



(14) 



On the other hand since Xt — Xj(e), we know that there always exists a function for any e € {±1}'', say 
such that ej(/e(xj(e)) — Sj(e)) > a/2. Hence 



E 



inf El/(2;t)-yt| 

t=i 



d 

<Ee 

d 

<Ee 



Ei/^(^*)-y*i 



El/.(x,(e))-yt| 



where the last step is because for all of block j, /e(xj(e)) does not depend on t and lies in the intervaj^ 
[sj(e) + ej^,ej] (i.e. the majority side) and so by replacing it by the maximal Cj in the same interval for 
that block we only make the quantity bigger. Now for a block j, define the number of labels that match 
the sign of ej (the majority) as Mj = J2teT- ^ ivt — ^j}- Since yt = & {il}: observe that the function 
9{'^j) — StgT I'^J ~ Vt\ is linear on the interval [—1, 1] with its minimum at the majority sign ej. Hence, the 
maximum over [sj(e) +6-,-^, e^] must occur at Cj = Sj(e) + ej^. Substituting, 



= Mi 



ejSj-(e) + I - l| + (fc - M,) \e,s,ie) + | + 1 



,W-f)+(fc-M.)(l + e.s.W + f) 
= fc + (fc-2Mj)(e,s,(e) + 



Hence, 



E 



t=i 



1 a 

mf^E - <dk + Y.E [ejs,{e){k - 2M,) + ^{k - 2M,) 



dfc + E E [eiSj(e)(A: - 2Mj)] + - E ^ [fc - 2Mj] 



Further note that k — 2Mj — — | X)teT ^'^'^ ^o ej{k — 2Mj) = — X)teT and so the expectation 
E [e,s,(e)(fc - 2M,)] = E [Eg,,^_,,^,^, hs,(e)(fc - 2Af,)]] = 



-"^We use the convention that [a, b] stands for [b, a] whenever a > b. 



36 



because Sj{e) is independent of it for t £Tj. Hence we see that 



E 



inf V|/(x,)-y,| 
.•' t=i 



< dfc + - ^ E [fc - 2Mj\ 



(15) 



Combining Equations (14 1 and (15) we can conclude that for any player strategy, 

T 



E 



.4=1 



-E 



-E 



inf 

.■^ 4=1 



> 



f ^E[2M,-fc] 



d 




E 









EE 



> 



ad /fc 



T fat. 



by Khinchine's inequality (e.g. [TTJ Lemma A.9]), yielding the theorem statement for T > fata. For the case 
of r < fato,, the proof is the same with k = 1 and the depth of the shattered tree being T, yielding a lower 
bound of aT/^/S. This completes the proof. 

For the second statement, observe that a lower bound on the value can be obtained by choosing any particular 
joint distribution on sequences (xi, yi), . . . , (xt, yt) in Eq. ([3|: 



V|(-F) > E 



^ l^^^^ixt.vt) \yt-ft{xt)\ ix,y)i:t^i - inf.^ |yt - /(xt)| 



To this end, choose any A'-valued tree x of depth T. Let . . . , be i.i.d. Rademacher random variables 
and define xt = x(j/i:t_i) deterministically (that is, the conditional distribution of Xt is a point distribution 
on x(?/i:4_i)). It is easy to see that this distribution makes the choice ft irrelevant, yielding 



V|(J') > E 



j2l~Mj2\yt-fixt)\ 
.t=i ^ t=i 



^yu...,yT sup Vyt/(xt) 



Since this holds for any tree x, we obtain the desired lower bound V^{J-) > 91t{J')- We remark that the 
lower bound of the first part of the Proposition could be directly proven as a lower bound on (J') ■ Q 

Proof of Theorem \ 1 7| The equivalence of 1 and 2 follows directly from Proposition [16) First, suppose 
that fato, is infinite for some a > 0. Then, the lower bound says that V'^{F) > aT /2\/2 and hence 
lim sup7n_j.oQ V^{F)/T > a/2y/2. Thus, the class T is not online learnable in the supervised setting. Now, 
assume that fata is finite for all a. Fix an e > and choose a = e/16. Using the upper bound, we have 



V|( J") < 8Ta + 24Vf J J fat^ log 



dp 



< 8Ta + 24Vf{l - a) y fata log 

< eT/2 + eT/2 



/2eT 



for T large enough. Thus, lim supgi_j.Q^ V^{J-)/T < e. Since e > was arbitrary, this proves that J- is online 
learnable in the supervised setting. 

The statement that Vj.(J^), ?l(J^), and T){J-) are within a multiplicative factor of O(log'^^'^ T) of each other 
whenever the problem is online learnable follows immediately from Lemma [TT] and Proposition |16[ 

□ 
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Proof of Lemma 19 First, we claim that for any x £ X, iaXa(yt(r,x)) = iaXa{Vt) for at most two r, r' G 
Further if there are two such r, r' g Ba then r,r' are consecutive elements of (i.e. \r — r'\ — a). 
Suppose, for the sake of contradiction, that idXa{Vt{r,x)) = ia.ia{Vt{r' ,x)) = fatQ,(Vt) for distinct r,r' S Ba 
that are not consecutive (i.e. \r — r'\ > 2a). Then let s = (r + r')/2 and without loss of generality suppose 
r > r' . By definition for any / G Vt(r, x), 

f{x) >r-a/2= {r' + r)/2 + (r - r')/2 - a/2 > s + a/2 

Also for any g e Vt{r' , x) we also have, 

g{x) < r' + a/2 = (r' + r)/2 + (/ - r)/2 + a/2 < s - a/2 

Let V and v' be trees of depth fata(Vt) a-shattered by Vt{r,x) and Vt{r\x), respectively. To get a contra- 
diction, form a new tree of depth fatQ,(Vt) + 1 by joining v and v' with the constant function Xi = a; as the 
root. It is straightforward that this tree is shattered by Vt{r,x) U Vt{r',x), a contradiction. 

Notice that the times t £ [T] for which \ ft{xt) ~ yt\ > en are exactly those times when we update current set 
Vt+i- We shall show that whenever an update is made, fatQ(Vt+i) < fatQ,(Vt) and hence claim that the total 
number of times \ ft{xt) — yt \ > a is bounded by fatc(J^). 

At any round we have three possibilities. First is when fatQ(Vt(r, xt)) < fata(V() for all r £ Ba- In this case, 
clearly, an update results in fatQ(Vt+i) = iaia{yt{\ijt\a,xt)) < i'Ata{Vt). 

The second case is when fate (14 (r, xt)) — fatQ(Vt) for exactly one r £ Ba- In this case the algorithm chooses 
ft{xt) = r. If the update is made, \ ft{xt) ~ yt \ > a and thus lyt\a 7^ ft{xt)- We can conclude that 

fatQ(V(+i) = ia,ta{Vt{[yt\a,xt)) < fata{Vt{ft{xt), xt)) = fata(Vt) 

The final case is when fatQ,(Vt(r, Xt)) = ia.ta{Vt{r' , Xt)) = ta,ta{Vt) and |r — r'| = a. In this case, the algorithm 
chooses ft{xt) — ^--^ ■ Whenever yt falls in either of these two consecutive intervals given by r or r' , we 
have \ft{xt) — yt \ < a, and hence no update is made. Thus, if an update is made, \_yt\a 7^ f and \_yt\a 7^ f'- 
However, for any element or B^ other than r, r', the fat shattering dimension is less than that of Vt- That is 

fat„(T4+i) = fata(Ft([ytJa,xt)) < fat„(T4(r, xt)) = i!ita{Vt{r\xt)) = fat„(Vt). 

We conclude that whenever we update, fatQ(Vt+i) < fat(j(V(), and so we can conclude that algorithm's 
prediction is more than a away from yt on at most fata (.7^) number of rounds. □ 



Proof of Corollary For the choice of weights pi — we see from Proposition 34 that for any i. 



E [Rt] <a,T+^ Tfat^. log (^^^ + Vt (3 + 2 log(i)) 

Now for any a > let ia be such that a < 2~*° and for any i < i^, a > 2~'°. Using the above bound on 
expected regret we have that 



E[Rt] < ai„r+ Wrfat„,^ log ( — ) +%/r(3 + 21og(i„)) 



However for our choice of ia we see that ia < log(l/a) and further a^^ < a. Hence we conclude that 



E[Rt] < ar+ yrfatalog (^^j + Vr(^3 + 21oglog 
Since choice of a was arbitrary we take infimum and get the result. □ 



^The argument should be compared to the combinatorial argument in Theorem [H] 
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Proof of Lemma 23 Without loss of generality assume L = 1. The general case follow from this by 
simply scaling <j> appropriately. We will also use the shorthand 4>{f, z) to denote ip{f{z), z). We have 



fH(0(J")) =supE, 



sup Vet(/)(/,Xf(e)) 



The proof proceeds by sequentially using the Lipschitz property of 0(/(xt(e))) for increasing t. Towards this 
end, define 



Rt = supEj 



t T 

sup^es/(Xs(e)) + Xs(e)) 

S = t+1 



Note that Rq = and Rt = ^{J-)- We need to show Rq < Rt and we will show this by proving 

Rt-i < Rt for all t e [T]. So, let us fix t e [T] and start with Rt-i: 



i?t_i = supEs 

X 

We can write the above suprenium as, 



t-i T 
sup Ves/(xs(e)) + Ve,,(/)(/,x,(e)) 



Rt-i = sup Eej . . . sup Eej SUpEgj^j.^ 
= sup Eei ... sup S"^(a;i:t,ei:t_i) , 



t-1 T 

sup^es/(a;s) + et0(/,a;t) + ^ es0(/, x^_f (et+i^r)) 



where we have simply defined 

5"^(xi:t,ei:t_l) = Efj supE^j^j^, 



t-1 T 

sup ^es/(a;s) + £*(/'(/, a;t)+ ^ es(?!)(/, Xs_t(et+i:T)) 

/^•^s=l s=t+l 



Here, x ranges over all Z-valued trees of depth T — t 
Similarly, Rt can be written as, 

t-1 

Rt = sup Eej^ . . . sup E^j SUpEej^j.y 
a^iG^ Xi^Z- X 



sup y]es/(xs) + et/(xt) + Y] Es^C/, x^-t (ef+i:T)) 



s=t+l 



sup Eei ... sup S'(a;i:t,ei:f_i) 

xi^Z Xt£Z 



where we have defined 

S{xi;t, ei:f-l) = Eej supEej_|^i^^ 



t-1 T 

sup^es/(a;s) + ef/(a;t) + X! Xs-t(et+i:T)) 

/^•^s=l s=f+l 



Thus, to prove -Rf_i < Rt it suffices to prove S"^(xi:t, ei:t_i) < S{xi;t,f.i:t-i) for all G Z* and ei:f_i G 
{±1}*~^. Fix xi:t, ei;t-i- By explicitly taking expectation w.r.t. et in the definition of S"*, we have 



25^ = sup E,,^,^^ 

X 

+ supEe^_^j^^ 

= SUpEej^j., 



t-1 T 

sup ^es/(a;s) + (/>(/, a;*) + ^ 65^(7, Xs_t(et+i:T)) 

.^^•^s=l s=f+l 
f-1 T 

sup^es5r(xs) - (/)(5,a;t) + ^ e^0(5, x',_((et+i:T)) 



s=t+l 



f-1 T 

sup ^es(/(xs)+g(a;s)) + 0(/,a;t)-(^(g,a;t)+ '^sW^is-tiet+i-.T)) + (p{9,K-ti<^t+i:T)) 

f'3&^ s=l s=t+l 
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Now (f){f,xt) - (j){g,xt) = (j){f{xt),xt) - (f>{g{xt),xt) is upper bounded by \f{xt) - g{xt)\ because <!){■, z) is 
1-Lipschitz for any z. Hence, 



25^<supE,,^,, 



t-1 T 

sup Ves(/(xs) + -5(xf)| + V es{(l){f,is-t{et+i:T)) + (p{9,K-ti<^t+i:T)) 



s=t+l 



Since, for any et+i-.T, the first and last sum above are unchanged if we simultaneously exchange / with g 
and X with x', the above supremum is actually equal to one where the absolute value in the middle term is 
absent. That is. 



2 5^<sup]E,,^,, 
= supE,,^^^^ 
+ supEe^_^^^y 



t-l T 

sup ^es{f{xs)+g{xs)) + f{xt)-g{xt)+ V es{(t){J,^^s-t{<^t+l■.T)) + (l>{g,K-t{<^t+i:T)) 



s=t+l 



2Eej supEe^^i^^ 

X 

2S'(a;i:(,ei:(_i) 



sup ^ es/(xs) + /(xt) + ^ e50(/,Xs_t(et+i:T)) 

/^•^s=l s=t+l 
t-l T 

sup^e^5r(a;^) -5(a;t) + X! ^^'^(S: ^s-t(et+i:T)) 

t-l T 

sup es/(a;s) + et/(a;f) + Xs-t(et+i:T)) 



s=t+l 



□ 

Proof of Lemma \24\ Without loss of generality assume that the Lipschitz constant L — 1 because the 
general case follows by scaling (f). Now note that by Theorem |9] we have that 



^{(j) oF)< inf |4ra ^ 12 j log 7V2(5, o J", T) djj 



(16) 



Now we claim that we can bound 



log N2{5,<i>oF,T)<Y, log A4o (<5, , T) 



To see this we first start by noting that 



^ ^ E - <t'(yt{e))f < ^ ^ Emax (/,(x,(e))) - v^(e))' < ^maxmax (/,(x,(e))) - v^(6) 

< max |/,(xf(e))) - v^'(e)| 

This means that if we have Vi, . . . ,Vk that are minimal L^^ covers for J^i , . . . , J^^ at level S then if we 
construct a cover V = Vi x . . . x Vk tor T then for any / = (/i, . . . , /fc) e and any e G {±1}"^, there exists 
V = (v^, . . . , v'^) e y such that 



^ ^ E WfiMe))) HM^))f < maxmax |/,(x,(6))) - v^"(6)| < <5 



40 



Hence we see that V is an cover at scale S for 6 o T. Hence 



k k 

log N2{S, o ^, T) < log AAoo(<5, ^ o T) < \og{\V\) = logd^J I) = E -^J' ^) 



as claimed. Now using this in Equation 16 we have that 



T^log A/-oo(<5,J-, ,r) d6 

\ 3 = 1 



D\{(j)oJ-) <inf |4Ta + 12 J 

{k 1 
ATa + 12j2 [ log Noo{S,:Fj,T) 

- Yl '^^ 1^^" ^ / \/^ log AAoo((5,JS-,T) djj 
Now applying Theorem 17 we conclude as required that 

k 

$R(0 oF)<0 (log3/'(r)) 5] 5H(J-,) 



□ 



(1 — ||a; — a\\ao)h{a) if ||a; — a||oo < 1 for some a e {±1}*^ 
otherwise 



Proof of Corollary \25\ We first extend the binary function 6 to a function b to any x S M'^ as follows 

First note that b is well-defined since all points in the fc-cube are separated by Loo distance 2. Further 
note that b is 1-Lipschitz w.r.t. the Lao norm and so applying Lemma 
corollary. 
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we conclude the statement of the 

□ 



Proof of Proposition \26\ The most difficult of these is Part 4, which follows immediately by Lemma [23| 
by taking </>{■, z) there to be simply (/>{■)■ The other items follow similarly to Theorem 15 in [27, and we 
provide the proofs for completeness. Note that, unlike Rademacher complexity defined in [37], Sequential 
Rademacher complexity does not have the absolute value around the sum. 

Part 1 is immediate because for any fixed tree x and fixed realization of {e^}, 

T T 

sup Vet/(xt(e)) < sup Vet/(xt(e)) , 

Now taking expectation over e and suprcmum over x completes the argument. 
To show Part 2, first observe that, according to Part 1, 

m{T) < $n(conv(J')) . 

Now, any h e conv(J^) can be written as h = X]j=i '^jfj with X^jLi Q^j — Ij cij ^ 0. Then, for fixed tree x 
and sequence e, 

T m T T 

^et/i(xt(e) = ^ttj ^ef/j(xt(e) < sup ^ et/(xt(e)) 



i=i t=i 



t=i 
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and thus 

T T 

sup ^£th{xt{e) < sup^et/(xt(e) . 

/iSconv(J^) 

Taking expectation over e and supremum over x completes the proof. 

To prove Part 3, first observe that the statement is easily seem to hold for c > 0. That is, yi{cJ-) = cfH(J^) 
follows directly from the definition. Hence, it remains to convince ourselves of the statement for c = — 1. 
That is, D\{—J^) = To prove this, consider a tree that is a reflection of x. That is, xf (e) = xt(— e) 

for all t € [T]. It is then enough to observe that 



sup Vet/(x4(e)) 

T 

sup Vet/(xt(-e)) 



E, 



E, 



sup V-et/(xt(e)) 

T 

sup^e,/(xf(6)) 



where we used the fact that e and — e have the same distribution. As x varies over all trees, x^ also varies 
over all trees. Hence taking the supremum over x above finishes the argument. 

Finally, for Part 5, 

supjf^e, U + h) (x,(e))| = |supX]eJ(x,(6))| + |x]etMxt(e))| 

Note that, since /i(xt(e)) only depends on ei:t_i, we have 

E, [et/i(xt(e))] = E,,^,_, [E [et\ei:t-i] K^t{t)] = . 

Thus, 



+ h) = fH( J") 



□ 



Proof of Proposition 21 We use linearity of the functions in J^vv to write 

T 



fRT(J^w) = sup Ee 



= sup Eg 



sup (u;,xt(e)) 



sup ( w, V'e(Xf(e) 
\ ■S 



Let ^P* be the Fenchel conjugate of ^. By Fenchel- Young inequality, for any A > 0, 



< 



A 



A 



Taking supremum over w £ W, we get, 



sup ( w, 2^ etxt(e) ) < ^— + ^ 



wew 
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Now, taking expectation w.r.t. e, we get 



sup ( w,y^efXt(e) 



< 



^ Tl 



A 



A 



Now we use Lemma 28 below with Zt — AetXt(e). Note that E [Zt \ ei:t-i] — and ^ •^ll'^ll* almost 

surely. Thus, we get, 



E, 



sup ( w,y^£fXf(e) 



wew 



< 



A 



A^ IIA-Ii^r 

2^7a 



Simplifying and optimizing over A > 0, gives 



□ 



Proof of Lemma 29 Consider the game A'cvx) and fix a randomized strategy tt of the player. Then, 
the expected regret of 11 against any adversary playing gi, . . . , gx can be bounded as 



t=l 

T 



t=l 

T 



<J29t (E„,^^,(gi,,_i) [ut]) - jn^Xlf*(") 



= R(7r',5i:T) . 

Here we used Jensen's inequality in the second line and tt' is simply the deterministic strategy obtained from 
TT that, on round t, plays 

^nt~7Tt{gi:t-i) K] • 

This means that Vt(-F,A'cvx) = V^'='(J^, Afcvx) where V^*** is defined as the minimax regret obtainable only 
using deterministic player strategies. Now, we appeal to Theorem 14 in [2] that says V'p^^ (^J~ , ^cvx ) — 
V^°*(J^, A'lin). Since X^n also consists of convex (in fact, linear) functions, the above argument again gives 
V^°*(J^, A"!!,!) — Vt(-^, Aiin). This finishes the proof of the lemma. □ 

Proof of Proposition \30[ Fix a 7 > and use loss 

1 yy<Q 

yy > 7 

First note that since the loss is l/7-Lipschitz, we can use Theorem [2] and the Rademacher contraction 
Lemma 23 to show that for each 7 > there exists a randomized strategy tt'^ such that 



E 



< 



Mi2Kf{^t),yt)+^^T{J') 

f^^T^ 7 



Now note that the loss is lower bounded by the Zero-one loss 1 {yy < 0} and is upper bounded by the margin 
Zero-one loss 1 {yy < 7}. Hence we see that for this strategy. 



E 



.t=i 



< M T^{f{xt)yt<j} + -9lT{r) 



(17) 
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Hence for each fixed 7 for randomized strategy given by tt^ we have the above bound. Now we discretize 
over 7's as 7i = 1/2' and using the output of the randomized strategies tt'^^ , tt'''^ , . . . that attain the regret 
bounds given in ( |17[ ) as experts and running exp erts algorithm given in Algorithm [s] with initial weight for 
expert i as = then using Proposition 34 we get that for this randomized strategy tt, such that for 
any i 



E 



IZE/t~^*(.i:*-i) ['^{Mxt)yt < 0}] 



< inf ^ 1 {fixt)yt < l^} + -5nT(-F) + ( 1 + 2 log 



t=i 



li 



ITT 



Now for any 7 > let be such that 7 < 2 and for any i < i-^ , 7 > 2 *^ . Then using the above bound 
we see that 

T 



E 



< inf y 1 {f{xt)yt < 27} + -fRT(-F) + Vt 1 + 21og 



ITT 

7i 



However note that < log(l/7) and so we can conclude that 



E 



< mf^ J2 1 if(^i)yt < 27}+-^t(^)+Vt ( 1 + 2 log 



t=i 



TT log (1/7) 

n/6 



□ 



Proof of Proposition \31\ We shall prove that for any i E [k], 
To see this note that 

T 



DlriJ'i) = supE, 



< sup Eg 



sup Vet yu;;-a(/j(xt(e))) 

«'-:|l«''lli<B.S^ 



sup \\w 111 max 



< supE, 

X 

= sup E, 

X 

= sup E, 

X 

< supE, 

X 

= 2B,supE 



t=i 

B, sup max<^ y eta(/(xf(e))) etcr(/(xt(e))) 



Bi sup 



t=i 



(. t=i 

{T T 
sup y'efcr(/(xf(e))), sup V -ejcr (/(xt(e))) 
t=i /e^.-i t=i 



B, sup y etCT(/(xt(e))) 



sup y etf7(/(xt(e))) 



sup Ef 



Bt sup y -etO-(/(xt(e))) 



< 2B,isupE 

X 

= 2B,i$nT(-F»-i) 



sup yet/(xt(e)) 



(Holder's inequality) 



(cr(0) = and G J^) 
(Proposition [26]) 



(Lemma 23 ) 
(18) 
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To finish the proof we note that 



dXriTi) = supE, 

X 

< supE, 

X 

< Bi supE 



sup ^etu;^xt(e) 

weM.'^:\\w\\i<Bi 



sup ||w||i 

■!iieK'':||lii||l<Bl 



t= 

max<'^eiXt(e)W 



Note that the instances x € X are vectors in M'^ and so for a given instance tree x, for any i e [d], x[z] given 
by only taking the i*'* co-ordinate is a valid real valued tree. Hence using Lemma [s] we conclude that 



9^t(-7^i) < Bi supE, 

X 

< BiV2TX^logd 
Using the above and Equation [18] we conclude the proof. 



max< ^etxt(e)[i] 



□ 



Proof of Proposition 32 For a tree of depth d, the indicator function of a leaf is a conjunction of no 
more than d decision functions. More specifically, if the decision tree consists of decision nodes chosen from 
a class T-L of binary-valued functions, the indicator function of leaf I (which takes value 1 at a point x \i x 
reaches Z, and otherwise) is a conjunction of di functions from "H, where di is the depth of leaf I. We can 
represent the function computed by the tree as the sign of 



g{x) = ^ wiai l\ hi^i{x) 



where the sum is over all leaves I, wi > 0, = 1, cr/ G {il} is the label of leaf /, hii £ H, and the 

conjunction is understood to map to {0, 1}. Let J- be this class of functions. Now note that if we fix some 
L > then we see that the loss 



1 if a < 
OL{a)={ 1-La ifO<a<l/L 







otherwise 



is L-Lipschitz and so by Theorem [2] and Lemma [23] we have that for every i > 0, there exists a randomized 
strategy tt^ for the player, such that for any sequence zi — {xi, yi) , . . . , — {xt, Vt), 



E 



Now note that cf)]^ upper bounds the step function and so 

T 



E 
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t=l 



Now say i* G T is the minimizcr of X]t=i ■'- {^(^t) 7^ Vt} then note that 

T T 

t=i i 

T 

< ^ 1 {r (a;*) ^ + E ^^(0 niax(0, 1 - Lwi) 

T 

< ^ 1 {r (a:*) ^ yj + E niax (o, (1 - Lwi)Ct{1)) 

T 

= inf E 1 {^(^*) ^ + E (O' (1 - Lwi)Ct{1)) 



t=i 



Hence we see that 

T 



E 



1 

^ ^ff E 1 {^(^*) ^ + E "^^^ (1 - Lwi)Ct{1) 



Now if we discretize over L as Li = i for aU i e N and run experts algorithm |3] with tti, 7r2, . . . as our experts 
and weight of expert vr^ is pi = so that XiiPi = ^ then we get that for this randomized strategy P, we 

have from Proposition [34| that for ah i € N, 



E 



- E 1 {*(^*) ^ + E (1 ~ Lwi)Ct{1)) + LUi{T) + Vf + 2\/riog(L7r/%/6) 



Now we pick L = \{l : Ct{1) > 2$H(r)}| =: Meaf and also pick = if C't(0 < 2m{T) and = l/L 
otherwise. Hence we see that 



E 



.t=l 



inf ^ 1 {t{x,) ^yt} + Yl {Ct{1) < 2m{T)} 

t=i I 

+ 2fH(r) ^ 1 {c't(/) > 2iH(r)} + \/r + 2Vriog(Meaf^/y6) 
I 

T 

inf ^ 1 {^(xt) ^ zjj + ^ minCCTCO, 2^n(r)) + Vt (l + 2 log(iVieaf7r/V6)) 



Now finally we can apply Corollary 25 to bound *K(T) < ^©(log ' T) 5H('H) and thus conclude the proof by 
plugging this into the above. □ 

Exponentially Weighted Average (EWA) Algorithm on Countable Experts 

We consider here a version of the exponentially weighted experts algorithm for countable (possibly infinite) 
number of experts and provide a bound on the expected regret of the randomized algorithm. The proof of 
the result closely follows the finite case (e.g. [11] Theorem 2.2]). 

Say we are provided with countable experts Ei, E2, ■ ■ ■ where each expert can herself be thought of as a 
randomized/deterministic player strategy which, given history, produces an element of at round t. Here 
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Algorithm 3 EWA {Ei, E2, . . Pi,P2, ■ 



wt ^ Pi 
for t = 1 to T do 

Pick randomly an expert i with probability 
Play /* = /* 
Receive xt 

Update for each i, w*^^ — ""i^ 



end for 



-vff(^t) 



we also assume that J- C [0, 1]"^ contains only non-negative functions. Denote by fl the function output by 
expert i at round t given the history. The EWA algorithm we consider needs access to the countable set of 
experts and also needs an initial weighting on each expert pi,p2, ■ ■ ■ such that X^i-Pi ~ 1- 

Proposition 34. For the exponentially weighted average forecaster (Algorithm^ with rj = T~^^'^ yields 



E 



for any i € N. 

Proof. Define Wt = Ei-P*e^''^j=i ■'^^ ^^^'^ Then note that 



-log (e-'"'^ 



Now using Hoeffding's inequality (see [TTl Lemma 2.2]) we have that 



log 



t-i 



Summing over t we get 



log(W^T) - log(W^o) = E log ( - "''"'^ 



E/*(^ 



.t=i 



Note that W^o = 12iPi = 1 and so log(Wo) = 0. Also note that for any i eN, 

log(W^T) = log (j2P^e-^^^-^ > log (pf'^"^^ = - '^E /*(^*) 



Hence using this with Equation [19] we see that 

T 

log(K)-?7E/*(2;t) < -'^E 



E/t(^*) 



Rearranging we get 



E 



E/*(^*) 



<E/'(^*) 



^ + - log - 



Using 77 = ^ we get the desired bound. 



(19) 



□ 
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B Conditions for the Minimax Theorem 



In this section we specify conditions under which the minimax theorem employed in Theorem [T] is vahd. 
Suppose that J" is a subset of a complete separable metric space and Bjr is the cr-field of Borel subsets of F. 
Let Q denote the set of all probability measures on F. Similarly, let V be the set of all probability measures 
on X . 

Under consideration is the question of conditions on F and X that guarantee 



inf supE/^g,a;^pF(/, x) = sup inf E/^g,a;^pF(/, x) (20) 

for a bounded measurable function F : F x X i-^ R. In addressing this question, we appeal to the following 
result (in a slightly modified form from the original): 

Theorem 35 (Theorem 3 in |35j). Let A be a nonempty convex weakly compact subset of a Banach space 
E. Let B be a nonempty convex subset of the dual space, E' , of E. Then 

inf sup 6(a) = sup inf 6(a) 

To show how this theorem can be used for our purposes, write 

9p{q) ■= ^f~q,x^pF{f, x) and G := {gp : p e V} 



Our desired minimax identity ( 20 ) can now be written as 



inf sup g{q) = sup inf g{q). 

We can view gp as a linear functional on Q, the set of all Borel probability measures on F. It can be verified 
that Q is a subset of a Banach space E whose dual E' contains the set of all bounded Borel measurable 
functions on F. It is then clear that G C E'. 

To appeal to the above theorem it is therefore enough to delineate conditions under which Q is weakly 
compact (clearly, Q is convex). By a fundamental result of Prohorov, weak compactness of the set of 
probability measures on a complete separable metric space is equivalent to uniform tightness (see e.g. [9j 
Theorem 8.6.2.], |37j). Depending on the application, various assumptions on F can be imposed to guarantee 



uniform tightness of Q. First, if F itself is compact, then Q is tight, and hence (20) holds. In infinite 
dimensions, however, compactness of is a very restrictive property. Thankfully, tightness can be established 
under more general assumptions on F, as we show next. 

By Example 8.6.5 (ii) in j9j, a family Q of Borel probability measures on a separable reflexive Banach space 
E is uniformly tight (under the weak topology) precisely when there exists a function V : E t-^ [0, oo) 
continuous in the norm topology such that 

Hm V{f) ~ oo and supE/^qF(/) < oo. 

ll/ll-S-oo qeQ 



For instance, if is a subset of a ball in E, it is enough to take V{f) = ||/||. We conclude that (20) holds 
whenever is a subset of a ball in a separable reflexive Banach space. 
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